Skip to content

dataset_registry

dataset_registry

Dataset registry: load and resolve dataset definitions from YAML.

Map human-readable dataset names to access strategies, variable specifications, and derivation rules. The registry is the single source of truth for "what datasets exist and how to access them." Pipeline stage 2 (stage2_resolve_datasets) consults the registry to resolve user-requested datasets and variables into concrete access instructions.

The registry supports five access strategies (stac_cog, local_tiff, nhgf_stac, climr_cat, native_zarr/converted_zarr) and three variable types (direct VariableSpec, terrain-derived DerivedVariableSpec, multi-source categorical DerivedCategoricalSpec, and multi-source continuous DerivedContinuousSpec).

References

.. [1] docs/design.md, section 6.6 -- Dataset registry schema design. .. [2] docs/design.md, section 11.3 -- Registry YAML conventions.

See Also

hydro_param.config : Pipeline configuration schema (DatasetRequest). hydro_param.data_access : Functions that use registry entries to fetch data.

VALID_CATEGORIES module-attribute

VALID_CATEGORIES: frozenset[str] = frozenset(
    {
        "climate",
        "geology",
        "hydrography",
        "land_cover",
        "snow",
        "soils",
        "topography",
        "water_bodies",
    }
)

Valid dataset registry categories.

These correspond to the per-category YAML files bundled in hydro_param/data/datasets/. Used by :class:~hydro_param.config.PipelineConfig to validate category keys in the datasets: config section.

VariableSpec

Bases: BaseModel

Describe a variable available directly in a source dataset.

Each VariableSpec maps a logical variable name to its location within a source dataset (band number, STAC asset key, or file path override) and carries metadata for SIR normalization (units, long name, categorical flag).

ATTRIBUTE DESCRIPTION
name

Logical variable name used throughout the pipeline (e.g., "elevation", "land_cover", "ksat").

TYPE: str

band

Raster band number for multi-band GeoTIFFs. Default 1.

TYPE: int

units

Source data units (e.g., "m", "log10(cm/hr)", "%"). Empty string for dimensionless quantities.

TYPE: str

long_name

Human-readable description for NetCDF attributes and documentation.

TYPE: str

native_name

Variable name in the source data (e.g., OPeNDAP/CF name like "daily_mean_temperature_2m"). Required for temporal datasets to map gdptools output back to logical names.

TYPE: str

categorical

True for land-cover or other classification variables. Categorical variables produce per-class fraction columns in zonal statistics rather than continuous summary statistics.

TYPE: bool

asset_key

Per-variable STAC asset key override (e.g., "mukey" for gNATSGO). When None, uses the dataset-level asset_key.

TYPE: str or None

source_override

Per-variable source path or URL override (e.g., individual POLARIS VRT files). When None, uses the dataset-level source.

TYPE: str or None

scale_factor

Multiplicative scale factor for integer-encoded rasters (e.g., 0.01 for values stored as value × 100). Follows CF-conventions scale_factor semantics. When None, no scaling is applied. The pipeline applies this factor after zonal statistics so the SIR contains physically meaningful values.

TYPE: float or None

DerivedVariableSpec

Bases: BaseModel

Describe a variable derived from another variable in the same dataset.

Derived variables are computed from a source variable using a named method (e.g., slope and aspect from elevation via terrain analysis). They are resolved alongside direct variables in stage 2 and processed in stage 4.

ATTRIBUTE DESCRIPTION
name

Logical name for the derived variable (e.g., "slope", "aspect").

TYPE: str

source

Name of the source VariableSpec this is derived from (e.g., "elevation").

TYPE: str

method

Derivation method passed to the derivation function (e.g., "horn" for Horn 1981 finite-difference terrain analysis). The derivation function is selected by variable name via hydro_param.data_access.DERIVATION_FUNCTIONS.

TYPE: str

units

Units of the derived variable (e.g., "degrees").

TYPE: str

long_name

Human-readable description for metadata.

TYPE: str

DerivedCategoricalSpec

Bases: BaseModel

Describe a categorical variable derived from multiple source variables.

Multi-source categorical derivations classify pixels by combining two or more source bands (e.g., USDA texture triangle from sand/silt/clay percentages). The result is a single-band categorical raster processed with categorical zonal statistics to produce per-class fraction columns.

Unlike DerivedVariableSpec (single source, continuous output), this always produces categorical output with per-class fractions.

ATTRIBUTE DESCRIPTION
name

Logical name for the derived variable (e.g., "soil_texture").

TYPE: str

sources

Names of the source VariableSpec entries this is derived from (e.g., ["sand", "silt", "clay"]). Must contain at least 2 entries.

TYPE: list[str]

method

Classification method key used to look up the derivation function via hydro_param.data_access.CATEGORICAL_DERIVATION_FUNCTIONS.

TYPE: str

units

Units of the derived variable (typically "class").

TYPE: str

long_name

Human-readable description for metadata.

TYPE: str

DerivedContinuousSpec

Bases: BaseModel

Describe a continuous variable derived from pixel-level arithmetic on multiple sources.

Multi-source continuous derivations apply an arithmetic operation (multiply, divide, add, subtract) to two or more aligned source rasters before zonal statistics. This preserves within-HRU spatial correlation that would be lost by aggregating each raster independently and combining the results.

Unlike DerivedCategoricalSpec (multi-source, categorical output), this always produces continuous output processed with standard zonal statistics (mean, median, etc.).

ATTRIBUTE DESCRIPTION
name

Logical name for the derived variable (e.g., "soil_moist_product").

TYPE: str

sources

Names of the source VariableSpec entries to combine. Must contain at least 2 entries. All sources must belong to the same dataset.

TYPE: list[str]

operation

Arithmetic operation applied left-to-right across sources via functools.reduce.

TYPE: {'multiply', 'divide', 'add', 'subtract'}

align_to

Name of the source whose grid (resolution, extent, CRS) is used as the resampling template. Must be one of sources.

TYPE: str

units

Units of the derived variable after the operation.

TYPE: str

long_name

Human-readable description for metadata.

TYPE: str

scale_factor

Multiplicative factor applied to zonal statistics output (e.g., 0.01 to convert from percent to fraction).

TYPE: float or None

resampling_method

Rasterio resampling method name for aligning non-template sources (default "nearest").

TYPE: str

DownloadFile

Bases: BaseModel

Describe a single downloadable file in a multi-file dataset.

Used by the hydro-param datasets download CLI command to stage local data files required by the local_tiff access strategy.

ATTRIBUTE DESCRIPTION
year

Calendar year this file covers.

TYPE: int

variable

Variable name this file provides (e.g., "ksat", "clay").

TYPE: str

url

Direct download URL for the file.

TYPE: str

size_gb

Approximate file size in gigabytes for progress reporting. None if unknown.

TYPE: float or None

DownloadInfo

Bases: BaseModel

Describe download provenance for datasets requiring local staging.

Some datasets (e.g., POLARIS soil data, GFv1.1 rasters) cannot be accessed through STAC or OPeNDAP and must be downloaded to local disk before processing. DownloadInfo records where to get the data, how large it is, and whether requester-pays access is needed.

Supports two modes: explicit files (a fixed list of DownloadFile entries) and template mode (a URL template expanded over year_range x variables_available).

ATTRIBUTE DESCRIPTION
url

Single-file download URL (mutually exclusive with files and url_template).

TYPE: str

size_gb

Approximate total download size in gigabytes.

TYPE: float or None

format

File format description (e.g., "GeoTIFF", "VRT").

TYPE: str

notes

Human-readable notes about access requirements.

TYPE: str

files

Explicit list of downloadable files (multi-file datasets).

TYPE: list[DownloadFile]

url_template

Python format string with {variable} and {year} placeholders (e.g., "https://example.com/{variable}_{year}.tif").

TYPE: str

year_range

Two-element [start, end] list for template expansion. Required when url_template is set.

TYPE: list[int]

variables_available

Variable names available for template expansion. Required when url_template is set.

TYPE: list[str]

requester_pays

True if the data source requires requester-pays access (e.g., s3://usgs-landcover).

TYPE: bool

RAISES DESCRIPTION
ValueError

If none of url, files, or url_template is provided, or if url_template is set without valid year_range and variables_available.

expand_files

expand_files(
    *,
    years: set[int] | None = None,
    variables: set[str] | None = None,
) -> list[DownloadFile]

Expand download sources into a concrete list of files.

For template mode, iterate year_range x variables_available and format the url_template with {variable} and {year} placeholders. For explicit files mode, return the files list. In both modes, optional years and variables filters restrict the output.

PARAMETER DESCRIPTION
years

If given, only include files matching these calendar years.

TYPE: set[int] or None DEFAULT: None

variables

If given, only include files matching these variable names.

TYPE: set[str] or None DEFAULT: None

RETURNS DESCRIPTION
list[DownloadFile]

Expanded and filtered list of downloadable files.

Source code in src/hydro_param/dataset_registry.py
def expand_files(
    self,
    *,
    years: set[int] | None = None,
    variables: set[str] | None = None,
) -> list[DownloadFile]:
    """Expand download sources into a concrete list of files.

    For **template mode**, iterate ``year_range x variables_available``
    and format the ``url_template`` with ``{variable}`` and ``{year}``
    placeholders.  For **explicit files mode**, return the ``files``
    list.  In both modes, optional ``years`` and ``variables`` filters
    restrict the output.

    Parameters
    ----------
    years : set[int] or None
        If given, only include files matching these calendar years.
    variables : set[str] or None
        If given, only include files matching these variable names.

    Returns
    -------
    list[DownloadFile]
        Expanded and filtered list of downloadable files.
    """
    if self.url_template:
        start, end = self.year_range
        result = []
        for yr in range(start, end + 1):
            if years is not None and yr not in years:
                continue
            for var in self.variables_available:
                if variables is not None and var not in variables:
                    continue
                url = self.url_template.format(variable=var, year=yr)
                result.append(DownloadFile(year=yr, variable=var, url=url))
        return result

    result = list(self.files)
    if years is not None:
        result = [f for f in result if f.year in years]
    if variables is not None:
        result = [f for f in result if f.variable in variables]
    return result

DatasetEntry

Bases: BaseModel

Describe a single dataset in the registry.

Each entry captures everything needed to access, process, and normalize a source dataset: the access strategy, connection parameters (STAC catalog URL, collection, asset key, etc.), coordinate system, and the list of available variables.

The strategy field determines which data access pathway is used:

  • "stac_cog" -- STAC COG via Planetary Computer (3DEP, gNATSGO).
  • "local_tiff" -- local GeoTIFF files (POLARIS, GFv1.1).
  • "nhgf_stac" -- NHGF STAC catalog (NLCD Annual on OSN).
  • "climr_cat" -- ClimateR-Catalog via OPeNDAP (gridMET).
  • "native_zarr" / "converted_zarr" -- Zarr stores (planned).
ATTRIBUTE DESCRIPTION
description

Human-readable dataset description.

TYPE: str

strategy

Data access strategy identifier.

TYPE: str

catalog_url

STAC catalog URL (required for stac_cog).

TYPE: str or None

collection

STAC collection name (required for stac_cog and nhgf_stac).

TYPE: str or None

asset_key

Default STAC asset key. Default "data".

TYPE: str

gsd

Ground sample distance in metres (STAC COG spatial resolution).

TYPE: int or None

sign

STAC signing method (e.g., "planetary-computer").

TYPE: str or None

source

Local file path or remote URL for Zarr/local_tiff datasets.

TYPE: str or None

download

Download provenance for datasets requiring local staging.

TYPE: DownloadInfo or None

catalog_id

ClimateR-Catalog identifier (required for climr_cat).

TYPE: str or None

crs

Coordinate reference system as an EPSG string. Default "EPSG:4326".

TYPE: str

x_coord

Name of the x/longitude coordinate. Default "x".

TYPE: str

y_coord

Name of the y/latitude coordinate. Default "y".

TYPE: str

t_coord

Name of the time coordinate (required for temporal datasets).

TYPE: str or None

variables

Variables directly available in this dataset.

TYPE: list[VariableSpec]

derived_variables

Variables computed from other variables in this dataset.

TYPE: list[DerivedVariableSpec]

category

Dataset category for grouping (e.g., "topography", "soils", "land_cover").

TYPE: str

temporal

True for time-indexed datasets (e.g., gridMET, SNODAS).

TYPE: bool

time_step

Temporal resolution of the dataset. Required when temporal is True. None for static datasets.

TYPE: {'daily', 'monthly'} or None

year_range

Two-element [start, end] list of available calendar years. Must satisfy start <= end.

TYPE: list[int] or None

RAISES DESCRIPTION
ValueError

If required strategy-specific fields are missing, or if constraints are violated (e.g., temporal without t_coord).

DatasetRegistry

Bases: BaseModel

Contain and query all registered datasets.

Provides lookup by name and variable resolution across the full set of loaded datasets. Typically created by load_registry() from one or more YAML files.

ATTRIBUTE DESCRIPTION
datasets

Mapping of dataset name to entry. Names must be unique across all registry files.

TYPE: dict[str, DatasetEntry]

See Also

load_registry : Load a registry from YAML file(s).

get

get(name: str) -> DatasetEntry

Look up a dataset by name.

PARAMETER DESCRIPTION
name

Dataset name as it appears in the registry YAML (e.g., "3dep", "gnatsgo", "gridmet").

TYPE: str

RETURNS DESCRIPTION
DatasetEntry

The matching dataset entry.

RAISES DESCRIPTION
KeyError

If name is not found. The error message lists all available dataset names for debugging.

Source code in src/hydro_param/dataset_registry.py
def get(self, name: str) -> DatasetEntry:
    """Look up a dataset by name.

    Parameters
    ----------
    name : str
        Dataset name as it appears in the registry YAML (e.g.,
        ``"3dep"``, ``"gnatsgo"``, ``"gridmet"``).

    Returns
    -------
    DatasetEntry
        The matching dataset entry.

    Raises
    ------
    KeyError
        If ``name`` is not found.  The error message lists all
        available dataset names for debugging.
    """
    if name not in self.datasets:
        available = ", ".join(sorted(self.datasets.keys()))
        raise KeyError(f"Dataset '{name}' not found in registry. Available: {available}")
    return self.datasets[name]

resolve_variable

resolve_variable(
    dataset_name: str, variable_name: str
) -> (
    VariableSpec
    | DerivedVariableSpec
    | DerivedCategoricalSpec
    | DerivedContinuousSpec
)

Resolve a variable name to its specification within a dataset.

Search direct variables, derived variables, derived categorical variables, and derived continuous variables in the named dataset. Direct variables are checked first.

PARAMETER DESCRIPTION
dataset_name

Dataset name in the registry (e.g., "3dep").

TYPE: str

variable_name

Variable name to look up (e.g., "elevation", "slope").

TYPE: str

RETURNS DESCRIPTION
VariableSpec or DerivedVariableSpec or DerivedCategoricalSpec or DerivedContinuousSpec

The matching variable specification. Direct variables are checked first, then derived, then derived categorical, then derived continuous.

RAISES DESCRIPTION
KeyError

If the dataset is not found in the registry, or the variable is not found in the dataset. The error message lists all available variable names for debugging.

Source code in src/hydro_param/dataset_registry.py
def resolve_variable(
    self, dataset_name: str, variable_name: str
) -> VariableSpec | DerivedVariableSpec | DerivedCategoricalSpec | DerivedContinuousSpec:
    """Resolve a variable name to its specification within a dataset.

    Search direct variables, derived variables, derived categorical
    variables, and derived continuous variables in the named dataset.
    Direct variables are checked first.

    Parameters
    ----------
    dataset_name : str
        Dataset name in the registry (e.g., ``"3dep"``).
    variable_name : str
        Variable name to look up (e.g., ``"elevation"``, ``"slope"``).

    Returns
    -------
    VariableSpec or DerivedVariableSpec or DerivedCategoricalSpec or DerivedContinuousSpec
        The matching variable specification.  Direct variables are
        checked first, then derived, then derived categorical, then
        derived continuous.

    Raises
    ------
    KeyError
        If the dataset is not found in the registry, or the variable
        is not found in the dataset.  The error message lists all
        available variable names for debugging.
    """
    entry = self.get(dataset_name)
    for v in entry.variables:
        if v.name == variable_name:
            return v
    for dv in entry.derived_variables:
        if dv.name == variable_name:
            return dv
    for dcv in entry.derived_categorical_variables:
        if dcv.name == variable_name:
            return dcv
    for dcont in entry.derived_continuous_variables:
        if dcont.name == variable_name:
            return dcont
    available = (
        [v.name for v in entry.variables]
        + [dv.name for dv in entry.derived_variables]
        + [dcv.name for dcv in entry.derived_categorical_variables]
        + [dcont.name for dcont in entry.derived_continuous_variables]
    )
    raise KeyError(
        f"Variable '{variable_name}' not found in dataset '{dataset_name}'. "
        f"Available: {', '.join(available)}"
    )

get_all_dataset_names

get_all_dataset_names(
    registry: DatasetRegistry,
) -> set[str]

Return the set of all dataset names in the registry.

PARAMETER DESCRIPTION
registry

A loaded dataset registry.

TYPE: DatasetRegistry

RETURNS DESCRIPTION
set[str]

All dataset names (e.g., {"dem_3dep_10m", "gridmet", ...}).

Source code in src/hydro_param/dataset_registry.py
def get_all_dataset_names(registry: DatasetRegistry) -> set[str]:
    """Return the set of all dataset names in the registry.

    Parameters
    ----------
    registry : DatasetRegistry
        A loaded dataset registry.

    Returns
    -------
    set[str]
        All dataset names (e.g., ``{"dem_3dep_10m", "gridmet", ...}``).
    """
    return set(registry.datasets.keys())

load_registry

load_registry(
    path: str | Path,
    *,
    overlay_dirs: list[Path] | None = None,
) -> DatasetRegistry

Load a dataset registry from YAML file(s), with optional overlays.

When path is a directory, all *.yml and *.yaml files are loaded and merged into a single registry. Dataset names must be unique across all files -- duplicates raise ValueError.

Overlay directories (e.g., ~/.hydro-param/datasets/) are scanned after the primary registry. Overlay entries are merged into the result; on name collision, the overlay entry replaces the primary entry (no partial merge). Non-existent or empty overlay directories are silently skipped.

PARAMETER DESCRIPTION
path

Path to a single registry YAML file, or a directory containing per-category YAML files (e.g., the bundled hydro_param.data.datasets). Each file must have a top-level datasets: key mapping dataset names to entries.

TYPE: str or Path

overlay_dirs

Optional list of directories containing user-local registry overlays. Each directory is scanned for *.yml/*.yaml files. Later directories take precedence over earlier ones.

TYPE: list[Path] or None DEFAULT: None

RETURNS DESCRIPTION
DatasetRegistry

Merged registry containing all datasets found.

RAISES DESCRIPTION
FileNotFoundError

If path does not exist, is neither file nor directory, or the directory contains no YAML files with datasets.

ValueError

If a dataset name appears in more than one YAML file within the primary registry directory. Overlay collisions with the primary registry are resolved silently (overlay wins).

Examples:

>>> from hydro_param.pipeline import DEFAULT_REGISTRY
>>> registry = load_registry(DEFAULT_REGISTRY)
>>> entry = registry.get("dem_3dep_10m")
Source code in src/hydro_param/dataset_registry.py
def load_registry(
    path: str | Path,
    *,
    overlay_dirs: list[Path] | None = None,
) -> DatasetRegistry:
    """Load a dataset registry from YAML file(s), with optional overlays.

    When ``path`` is a directory, all ``*.yml`` and ``*.yaml`` files are
    loaded and merged into a single registry.  Dataset names must be
    unique across all files -- duplicates raise ``ValueError``.

    Overlay directories (e.g., ``~/.hydro-param/datasets/``) are scanned
    after the primary registry.  Overlay entries are merged into the
    result; on name collision, the overlay entry replaces the primary
    entry (no partial merge).  Non-existent or empty overlay directories
    are silently skipped.

    Parameters
    ----------
    path : str or pathlib.Path
        Path to a single registry YAML file, or a directory containing
        per-category YAML files (e.g., the bundled
        ``hydro_param.data.datasets``).  Each file must have a top-level
        ``datasets:`` key mapping dataset names to entries.
    overlay_dirs : list[Path] or None
        Optional list of directories containing user-local registry
        overlays.  Each directory is scanned for ``*.yml``/``*.yaml``
        files.  Later directories take precedence over earlier ones.

    Returns
    -------
    DatasetRegistry
        Merged registry containing all datasets found.

    Raises
    ------
    FileNotFoundError
        If ``path`` does not exist, is neither file nor directory, or
        the directory contains no YAML files with datasets.
    ValueError
        If a dataset name appears in more than one YAML file within
        the *primary* registry directory.  Overlay collisions with the
        primary registry are resolved silently (overlay wins).

    Examples
    --------
    >>> from hydro_param.pipeline import DEFAULT_REGISTRY
    >>> registry = load_registry(DEFAULT_REGISTRY)
    >>> entry = registry.get("dem_3dep_10m")
    """
    path = Path(path)
    if not path.exists():
        raise FileNotFoundError(f"Registry path does not exist: {path}")
    if path.is_file():
        registry = _load_registry_file(path)
    elif path.is_dir():
        registry = _load_registry_dir(path)
    else:
        raise FileNotFoundError(f"Registry path is neither a file nor directory: {path}")

    if overlay_dirs:
        registry = _merge_overlays(registry, overlay_dirs)

    return registry