dataset_registry¶

dataset_registry ¶

Dataset registry: load and resolve dataset definitions from YAML.

Map human-readable dataset names to access strategies, variable specifications, and derivation rules. The registry is the single source of truth for "what datasets exist and how to access them." Pipeline stage 2 (stage2_resolve_datasets) consults the registry to resolve user-requested datasets and variables into concrete access instructions.

The registry supports five access strategies (stac_cog, local_tiff, nhgf_stac, climr_cat, native_zarr/converted_zarr) and three variable types (direct VariableSpec, terrain-derived DerivedVariableSpec, multi-source categorical DerivedCategoricalSpec, and multi-source continuous DerivedContinuousSpec).

References

.. [1] docs/design.md, section 6.6 -- Dataset registry schema design. .. [2] docs/design.md, section 11.3 -- Registry YAML conventions.

VALID_CATEGORIES `module-attribute` ¶

VALID_CATEGORIES: frozenset[str] = frozenset(
    {
        "climate",
        "geology",
        "hydrography",
        "land_cover",
        "snow",
        "soils",
        "topography",
        "water_bodies",
    }
)

Valid dataset registry categories.

These correspond to the per-category YAML files bundled in hydro_param/data/datasets/. Used by :class:~hydro_param.config.PipelineConfig to validate category keys in the datasets: config section.

VariableSpec ¶

Bases: BaseModel

Describe a variable available directly in a source dataset.

Each VariableSpec maps a logical variable name to its location within a source dataset (band number, STAC asset key, or file path override) and carries metadata for SIR normalization (units, long name, categorical flag).

ATTRIBUTE	DESCRIPTION
`name`	Logical variable name used throughout the pipeline (e.g., `"elevation"`, `"land_cover"`, `"ksat"`). TYPE: `str`
`band`	Raster band number for multi-band GeoTIFFs. Default `1`. TYPE: `int`
`units`	Source data units (e.g., `"m"`, `"log10(cm/hr)"`, `"%"`). Empty string for dimensionless quantities. TYPE: `str`
`long_name`	Human-readable description for NetCDF attributes and documentation. TYPE: `str`
`native_name`	Variable name in the source data (e.g., OPeNDAP/CF name like `"daily_mean_temperature_2m"`). Required for temporal datasets to map gdptools output back to logical names. TYPE: `str`
`categorical`	`True` for land-cover or other classification variables. Categorical variables produce per-class fraction columns in zonal statistics rather than continuous summary statistics. TYPE: `bool`
`asset_key`	Per-variable STAC asset key override (e.g., `"mukey"` for gNATSGO). When `None`, uses the dataset-level `asset_key`. TYPE: `str or None`
`source_override`	Per-variable source path or URL override (e.g., individual POLARIS VRT files). When `None`, uses the dataset-level `source`. TYPE: `str or None`
`scale_factor`	Multiplicative scale factor for integer-encoded rasters (e.g., `0.01` for values stored as `value × 100`). Follows CF-conventions `scale_factor` semantics. When `None`, no scaling is applied. The pipeline applies this factor after zonal statistics so the SIR contains physically meaningful values. TYPE: `float or None`

DerivedVariableSpec ¶

Bases: BaseModel

Describe a variable derived from another variable in the same dataset.

Derived variables are computed from a source variable using a named method (e.g., slope and aspect from elevation via terrain analysis). They are resolved alongside direct variables in stage 2 and processed in stage 4.

ATTRIBUTE	DESCRIPTION
`name`	Logical name for the derived variable (e.g., `"slope"`, `"aspect"`). TYPE: `str`
`source`	Name of the source `VariableSpec` this is derived from (e.g., `"elevation"`). TYPE: `str`
`method`	Derivation method passed to the derivation function (e.g., `"horn"` for Horn 1981 finite-difference terrain analysis). The derivation function is selected by variable `name` via `hydro_param.data_access.DERIVATION_FUNCTIONS`. TYPE: `str`
`units`	Units of the derived variable (e.g., `"degrees"`). TYPE: `str`
`long_name`	Human-readable description for metadata. TYPE: `str`

DerivedCategoricalSpec ¶

Bases: BaseModel

Describe a categorical variable derived from multiple source variables.

Multi-source categorical derivations classify pixels by combining two or more source bands (e.g., USDA texture triangle from sand/silt/clay percentages). The result is a single-band categorical raster processed with categorical zonal statistics to produce per-class fraction columns.

Unlike DerivedVariableSpec (single source, continuous output), this always produces categorical output with per-class fractions.

ATTRIBUTE	DESCRIPTION
`name`	Logical name for the derived variable (e.g., `"soil_texture"`). TYPE: `str`
`sources`	Names of the source `VariableSpec` entries this is derived from (e.g., `["sand", "silt", "clay"]`). Must contain at least 2 entries. TYPE: `list[str]`
`method`	Classification method key used to look up the derivation function via `hydro_param.data_access.CATEGORICAL_DERIVATION_FUNCTIONS`. TYPE: `str`
`units`	Units of the derived variable (typically `"class"`). TYPE: `str`
`long_name`	Human-readable description for metadata. TYPE: `str`

DerivedContinuousSpec ¶

Bases: BaseModel

Describe a continuous variable derived from pixel-level arithmetic on multiple sources.

Multi-source continuous derivations apply an arithmetic operation (multiply, divide, add, subtract) to two or more aligned source rasters before zonal statistics. This preserves within-HRU spatial correlation that would be lost by aggregating each raster independently and combining the results.

Unlike DerivedCategoricalSpec (multi-source, categorical output), this always produces continuous output processed with standard zonal statistics (mean, median, etc.).

ATTRIBUTE	DESCRIPTION
`name`	Logical name for the derived variable (e.g., `"soil_moist_product"`). TYPE: `str`
`sources`	Names of the source `VariableSpec` entries to combine. Must contain at least 2 entries. All sources must belong to the same dataset. TYPE: `list[str]`
`operation`	Arithmetic operation applied left-to-right across sources via `functools.reduce`. TYPE: `{'multiply', 'divide', 'add', 'subtract'}`
`align_to`	Name of the source whose grid (resolution, extent, CRS) is used as the resampling template. Must be one of `sources`. TYPE: `str`
`units`	Units of the derived variable after the operation. TYPE: `str`
`long_name`	Human-readable description for metadata. TYPE: `str`
`scale_factor`	Multiplicative factor applied to zonal statistics output (e.g., 0.01 to convert from percent to fraction). TYPE: `float or None`
`resampling_method`	Rasterio resampling method name for aligning non-template sources (default `"nearest"`). TYPE: `str`

DownloadFile ¶

Bases: BaseModel

Describe a single downloadable file in a multi-file dataset.

Used by the hydro-param datasets download CLI command to stage local data files required by the local_tiff access strategy.

ATTRIBUTE	DESCRIPTION
`year`	Calendar year this file covers. TYPE: `int`
`variable`	Variable name this file provides (e.g., `"ksat"`, `"clay"`). TYPE: `str`
`url`	Direct download URL for the file. TYPE: `str`
`size_gb`	Approximate file size in gigabytes for progress reporting. `None` if unknown. TYPE: `float or None`

DownloadInfo ¶

Bases: BaseModel

Describe download provenance for datasets requiring local staging.

Some datasets (e.g., POLARIS soil data, GFv1.1 rasters) cannot be accessed through STAC or OPeNDAP and must be downloaded to local disk before processing. DownloadInfo records where to get the data, how large it is, and whether requester-pays access is needed.

Supports two modes: explicit files (a fixed list of DownloadFile entries) and template mode (a URL template expanded over year_range x variables_available).

ATTRIBUTE	DESCRIPTION
`url`	Single-file download URL (mutually exclusive with `files` and `url_template`). TYPE: `str`
`size_gb`	Approximate total download size in gigabytes. TYPE: `float or None`
`format`	File format description (e.g., `"GeoTIFF"`, `"VRT"`). TYPE: `str`
`notes`	Human-readable notes about access requirements. TYPE: `str`
`files`	Explicit list of downloadable files (multi-file datasets). TYPE: `list[DownloadFile]`
`url_template`	Python format string with `{variable}` and `{year}` placeholders (e.g., `"https://example.com/{variable}_{year}.tif"`). TYPE: `str`
`year_range`	Two-element `[start, end]` list for template expansion. Required when `url_template` is set. TYPE: `list[int]`
`variables_available`	Variable names available for template expansion. Required when `url_template` is set. TYPE: `list[str]`
`requester_pays`	`True` if the data source requires requester-pays access (e.g., `s3://usgs-landcover`). TYPE: `bool`

RAISES	DESCRIPTION
`ValueError`	If none of `url`, `files`, or `url_template` is provided, or if `url_template` is set without valid `year_range` and `variables_available`.

expand_files ¶

expand_files(
    *,
    years: set[int] | None = None,
    variables: set[str] | None = None,
) -> list[DownloadFile]

Expand download sources into a concrete list of files.

For template mode, iterate year_range x variables_available and format the url_template with {variable} and {year} placeholders. For explicit files mode, return the files list. In both modes, optional years and variables filters restrict the output.

PARAMETER	DESCRIPTION
`years`	If given, only include files matching these calendar years. TYPE: `set[int] or None` DEFAULT: `None`
`variables`	If given, only include files matching these variable names. TYPE: `set[str] or None` DEFAULT: `None`

RETURNS	DESCRIPTION
`list[DownloadFile]`	Expanded and filtered list of downloadable files.

Source code in src/hydro_param/dataset_registry.py

def expand_files(
    self,
    *,
    years: set[int] | None = None,
    variables: set[str] | None = None,
) -> list[DownloadFile]:
    """Expand download sources into a concrete list of files.

    For **template mode**, iterate ``year_range x variables_available``
    and format the ``url_template`` with ``{variable}`` and ``{year}``
    placeholders.  For **explicit files mode**, return the ``files``
    list.  In both modes, optional ``years`` and ``variables`` filters
    restrict the output.

    Parameters
    ----------
    years : set[int] or None
        If given, only include files matching these calendar years.
    variables : set[str] or None
        If given, only include files matching these variable names.

    Returns
    -------
    list[DownloadFile]
        Expanded and filtered list of downloadable files.
    """
    if self.url_template:
        start, end = self.year_range
        result = []
        for yr in range(start, end + 1):
            if years is not None and yr not in years:
                continue
            for var in self.variables_available:
                if variables is not None and var not in variables:
                    continue
                url = self.url_template.format(variable=var, year=yr)
                result.append(DownloadFile(year=yr, variable=var, url=url))
        return result

    result = list(self.files)
    if years is not None:
        result = [f for f in result if f.year in years]
    if variables is not None:
        result = [f for f in result if f.variable in variables]
    return result

DatasetEntry ¶

Bases: BaseModel

Describe a single dataset in the registry.

Each entry captures everything needed to access, process, and normalize a source dataset: the access strategy, connection parameters (STAC catalog URL, collection, asset key, etc.), coordinate system, and the list of available variables.

The strategy field determines which data access pathway is used:

"stac_cog" -- STAC COG via Planetary Computer (3DEP, gNATSGO).
"local_tiff" -- local GeoTIFF files (POLARIS, GFv1.1).
"nhgf_stac" -- NHGF STAC catalog (NLCD Annual on OSN).
"climr_cat" -- ClimateR-Catalog via OPeNDAP (gridMET).
"native_zarr" / "converted_zarr" -- Zarr stores (planned).

ATTRIBUTE	DESCRIPTION
`description`	Human-readable dataset description. TYPE: `str`
`strategy`	Data access strategy identifier. TYPE: `str`
`catalog_url`	STAC catalog URL (required for `stac_cog`). TYPE: `str or None`
`collection`	STAC collection name (required for `stac_cog` and `nhgf_stac`). TYPE: `str or None`
`asset_key`	Default STAC asset key. Default `"data"`. TYPE: `str`
`gsd`	Ground sample distance in metres (STAC COG spatial resolution). TYPE: `int or None`
`sign`	STAC signing method (e.g., `"planetary-computer"`). TYPE: `str or None`
`source`	Local file path or remote URL for Zarr/local_tiff datasets. TYPE: `str or None`
`download`	Download provenance for datasets requiring local staging. TYPE: `DownloadInfo or None`
`catalog_id`	ClimateR-Catalog identifier (required for `climr_cat`). TYPE: `str or None`
`crs`	Coordinate reference system as an EPSG string. Default `"EPSG:4326"`. TYPE: `str`
`x_coord`	Name of the x/longitude coordinate. Default `"x"`. TYPE: `str`
`y_coord`	Name of the y/latitude coordinate. Default `"y"`. TYPE: `str`
`t_coord`	Name of the time coordinate (required for temporal datasets). TYPE: `str or None`
`variables`	Variables directly available in this dataset. TYPE: `list[VariableSpec]`
`derived_variables`	Variables computed from other variables in this dataset. TYPE: `list[DerivedVariableSpec]`
`category`	Dataset category for grouping (e.g., `"topography"`, `"soils"`, `"land_cover"`). TYPE: `str`
`temporal`	`True` for time-indexed datasets (e.g., gridMET, SNODAS). TYPE: `bool`
`time_step`	Temporal resolution of the dataset. Required when `temporal` is `True`. `None` for static datasets. TYPE: `{'daily', 'monthly'} or None`
`year_range`	Two-element `[start, end]` list of available calendar years. Must satisfy `start <= end`. TYPE: `list[int] or None`

RAISES	DESCRIPTION
`ValueError`	If required strategy-specific fields are missing, or if constraints are violated (e.g., temporal without `t_coord`).

DatasetRegistry ¶

Bases: BaseModel

Contain and query all registered datasets.

Provides lookup by name and variable resolution across the full set of loaded datasets. Typically created by load_registry() from one or more YAML files.

ATTRIBUTE	DESCRIPTION
`datasets`	Mapping of dataset name to entry. Names must be unique across all registry files. TYPE: `dict[str, DatasetEntry]`

get ¶

get(name: str) -> DatasetEntry

Look up a dataset by name.

PARAMETER	DESCRIPTION
`name`	Dataset name as it appears in the registry YAML (e.g., `"3dep"`, `"gnatsgo"`, `"gridmet"`). TYPE: `str`

RETURNS	DESCRIPTION
`DatasetEntry`	The matching dataset entry.

RAISES	DESCRIPTION
`KeyError`	If `name` is not found. The error message lists all available dataset names for debugging.

Source code in src/hydro_param/dataset_registry.py

def get(self, name: str) -> DatasetEntry:
    """Look up a dataset by name.

    Parameters
    ----------
    name : str
        Dataset name as it appears in the registry YAML (e.g.,
        ``"3dep"``, ``"gnatsgo"``, ``"gridmet"``).

    Returns
    -------
    DatasetEntry
        The matching dataset entry.

    Raises
    ------
    KeyError
        If ``name`` is not found.  The error message lists all
        available dataset names for debugging.
    """
    if name not in self.datasets:
        available = ", ".join(sorted(self.datasets.keys()))
        raise KeyError(f"Dataset '{name}' not found in registry. Available: {available}")
    return self.datasets[name]

resolve_variable ¶

resolve_variable(
    dataset_name: str, variable_name: str
) -> (
    VariableSpec
    | DerivedVariableSpec
    | DerivedCategoricalSpec
    | DerivedContinuousSpec
)

Resolve a variable name to its specification within a dataset.

Search direct variables, derived variables, derived categorical variables, and derived continuous variables in the named dataset. Direct variables are checked first.

PARAMETER	DESCRIPTION
`dataset_name`	Dataset name in the registry (e.g., `"3dep"`). TYPE: `str`
`variable_name`	Variable name to look up (e.g., `"elevation"`, `"slope"`). TYPE: `str`

RETURNS	DESCRIPTION
`VariableSpec or DerivedVariableSpec or DerivedCategoricalSpec or DerivedContinuousSpec`	The matching variable specification. Direct variables are checked first, then derived, then derived categorical, then derived continuous.

RAISES	DESCRIPTION
`KeyError`	If the dataset is not found in the registry, or the variable is not found in the dataset. The error message lists all available variable names for debugging.

Source code in src/hydro_param/dataset_registry.py

def resolve_variable(
    self, dataset_name: str, variable_name: str
) -> VariableSpec | DerivedVariableSpec | DerivedCategoricalSpec | DerivedContinuousSpec:
    """Resolve a variable name to its specification within a dataset.

    Search direct variables, derived variables, derived categorical
    variables, and derived continuous variables in the named dataset.
    Direct variables are checked first.

    Parameters
    ----------
    dataset_name : str
        Dataset name in the registry (e.g., ``"3dep"``).
    variable_name : str
        Variable name to look up (e.g., ``"elevation"``, ``"slope"``).

    Returns
    -------
    VariableSpec or DerivedVariableSpec or DerivedCategoricalSpec or DerivedContinuousSpec
        The matching variable specification.  Direct variables are
        checked first, then derived, then derived categorical, then
        derived continuous.

    Raises
    ------
    KeyError
        If the dataset is not found in the registry, or the variable
        is not found in the dataset.  The error message lists all
        available variable names for debugging.
    """
    entry = self.get(dataset_name)
    for v in entry.variables:
        if v.name == variable_name:
            return v
    for dv in entry.derived_variables:
        if dv.name == variable_name:
            return dv
    for dcv in entry.derived_categorical_variables:
        if dcv.name == variable_name:
            return dcv
    for dcont in entry.derived_continuous_variables:
        if dcont.name == variable_name:
            return dcont
    available = (
        [v.name for v in entry.variables]
        + [dv.name for dv in entry.derived_variables]
        + [dcv.name for dcv in entry.derived_categorical_variables]
        + [dcont.name for dcont in entry.derived_continuous_variables]
    )
    raise KeyError(
        f"Variable '{variable_name}' not found in dataset '{dataset_name}'. "
        f"Available: {', '.join(available)}"
    )

get_all_dataset_names ¶

get_all_dataset_names(
    registry: DatasetRegistry,
) -> set[str]

Return the set of all dataset names in the registry.

PARAMETER	DESCRIPTION
`registry`	A loaded dataset registry. TYPE: `DatasetRegistry`

RETURNS	DESCRIPTION
`set[str]`	All dataset names (e.g., `{"dem_3dep_10m", "gridmet", ...}`).

Source code in src/hydro_param/dataset_registry.py

def get_all_dataset_names(registry: DatasetRegistry) -> set[str]:
    """Return the set of all dataset names in the registry.

    Parameters
    ----------
    registry : DatasetRegistry
        A loaded dataset registry.

    Returns
    -------
    set[str]
        All dataset names (e.g., ``{"dem_3dep_10m", "gridmet", ...}``).
    """
    return set(registry.datasets.keys())

load_registry ¶

load_registry(
    path: str | Path,
    *,
    overlay_dirs: list[Path] | None = None,
) -> DatasetRegistry

Load a dataset registry from YAML file(s), with optional overlays.

When path is a directory, all *.yml and *.yaml files are loaded and merged into a single registry. Dataset names must be unique across all files -- duplicates raise ValueError.

Overlay directories (e.g., ~/.hydro-param/datasets/) are scanned after the primary registry. Overlay entries are merged into the result; on name collision, the overlay entry replaces the primary entry (no partial merge). Non-existent or empty overlay directories are silently skipped.

PARAMETER	DESCRIPTION
`path`	Path to a single registry YAML file, or a directory containing per-category YAML files (e.g., the bundled `hydro_param.data.datasets`). Each file must have a top-level `datasets:` key mapping dataset names to entries. TYPE: `str or Path`
`overlay_dirs`	Optional list of directories containing user-local registry overlays. Each directory is scanned for `.yml`/`.yaml` files. Later directories take precedence over earlier ones. TYPE: `list[Path] or None` DEFAULT: `None`

RETURNS	DESCRIPTION
`DatasetRegistry`	Merged registry containing all datasets found.

RAISES	DESCRIPTION
`FileNotFoundError`	If `path` does not exist, is neither file nor directory, or the directory contains no YAML files with datasets.
`ValueError`	If a dataset name appears in more than one YAML file within the primary registry directory. Overlay collisions with the primary registry are resolved silently (overlay wins).

Examples:

>>> from hydro_param.pipeline import DEFAULT_REGISTRY
>>> registry = load_registry(DEFAULT_REGISTRY)
>>> entry = registry.get("dem_3dep_10m")

Source code in src/hydro_param/dataset_registry.py

def load_registry(
    path: str | Path,
    *,
    overlay_dirs: list[Path] | None = None,
) -> DatasetRegistry:
    """Load a dataset registry from YAML file(s), with optional overlays.

    When ``path`` is a directory, all ``*.yml`` and ``*.yaml`` files are
    loaded and merged into a single registry.  Dataset names must be
    unique across all files -- duplicates raise ``ValueError``.

    Overlay directories (e.g., ``~/.hydro-param/datasets/``) are scanned
    after the primary registry.  Overlay entries are merged into the
    result; on name collision, the overlay entry replaces the primary
    entry (no partial merge).  Non-existent or empty overlay directories
    are silently skipped.

    Parameters
    ----------
    path : str or pathlib.Path
        Path to a single registry YAML file, or a directory containing
        per-category YAML files (e.g., the bundled
        ``hydro_param.data.datasets``).  Each file must have a top-level
        ``datasets:`` key mapping dataset names to entries.
    overlay_dirs : list[Path] or None
        Optional list of directories containing user-local registry
        overlays.  Each directory is scanned for ``*.yml``/``*.yaml``
        files.  Later directories take precedence over earlier ones.

    Returns
    -------
    DatasetRegistry
        Merged registry containing all datasets found.

    Raises
    ------
    FileNotFoundError
        If ``path`` does not exist, is neither file nor directory, or
        the directory contains no YAML files with datasets.
    ValueError
        If a dataset name appears in more than one YAML file within
        the *primary* registry directory.  Overlay collisions with the
        primary registry are resolved silently (overlay wins).

    Examples
    --------
    >>> from hydro_param.pipeline import DEFAULT_REGISTRY
    >>> registry = load_registry(DEFAULT_REGISTRY)
    >>> entry = registry.get("dem_3dep_10m")
    """
    path = Path(path)
    if not path.exists():
        raise FileNotFoundError(f"Registry path does not exist: {path}")
    if path.is_file():
        registry = _load_registry_file(path)
    elif path.is_dir():
        registry = _load_registry_dir(path)
    else:
        raise FileNotFoundError(f"Registry path is neither a file nor directory: {path}")

    if overlay_dirs:
        registry = _merge_overlays(registry, overlay_dirs)

    return registry

dataset_registry¶

dataset_registry ¶

VALID_CATEGORIES module-attribute ¶

VariableSpec ¶

DerivedVariableSpec ¶

DerivedCategoricalSpec ¶

DerivedContinuousSpec ¶

DownloadFile ¶

DownloadInfo ¶

expand_files ¶

DatasetEntry ¶

DatasetRegistry ¶

get ¶

resolve_variable ¶

get_all_dataset_names ¶

load_registry ¶

VALID_CATEGORIES `module-attribute` ¶