dataset_registry¶
dataset_registry
¶
Dataset registry: load and resolve dataset definitions from YAML.
Map human-readable dataset names to access strategies, variable
specifications, and derivation rules. The registry is the single source
of truth for "what datasets exist and how to access them." Pipeline
stage 2 (stage2_resolve_datasets) consults the registry to resolve
user-requested datasets and variables into concrete access instructions.
The registry supports five access strategies (stac_cog, local_tiff,
nhgf_stac, climr_cat, native_zarr/converted_zarr) and three
variable types (direct VariableSpec, terrain-derived
DerivedVariableSpec, multi-source categorical
DerivedCategoricalSpec, and multi-source continuous
DerivedContinuousSpec).
References
.. [1] docs/design.md, section 6.6 -- Dataset registry schema design. .. [2] docs/design.md, section 11.3 -- Registry YAML conventions.
See Also
hydro_param.config : Pipeline configuration schema (DatasetRequest).
hydro_param.data_access : Functions that use registry entries to fetch data.
VALID_CATEGORIES
module-attribute
¶
VALID_CATEGORIES: frozenset[str] = frozenset(
{
"climate",
"geology",
"hydrography",
"land_cover",
"snow",
"soils",
"topography",
"water_bodies",
}
)
Valid dataset registry categories.
These correspond to the per-category YAML files bundled in
hydro_param/data/datasets/. Used by :class:~hydro_param.config.PipelineConfig
to validate category keys in the datasets: config section.
VariableSpec
¶
Bases: BaseModel
Describe a variable available directly in a source dataset.
Each VariableSpec maps a logical variable name to its location
within a source dataset (band number, STAC asset key, or file path
override) and carries metadata for SIR normalization (units, long
name, categorical flag).
| ATTRIBUTE | DESCRIPTION |
|---|---|
name |
Logical variable name used throughout the pipeline (e.g.,
TYPE:
|
band |
Raster band number for multi-band GeoTIFFs. Default
TYPE:
|
units |
Source data units (e.g.,
TYPE:
|
long_name |
Human-readable description for NetCDF attributes and documentation.
TYPE:
|
native_name |
Variable name in the source data (e.g., OPeNDAP/CF name like
TYPE:
|
categorical |
TYPE:
|
asset_key |
Per-variable STAC asset key override (e.g.,
TYPE:
|
source_override |
Per-variable source path or URL override (e.g., individual POLARIS
VRT files). When
TYPE:
|
scale_factor |
Multiplicative scale factor for integer-encoded rasters (e.g.,
TYPE:
|
DerivedVariableSpec
¶
Bases: BaseModel
Describe a variable derived from another variable in the same dataset.
Derived variables are computed from a source variable using a named method (e.g., slope and aspect from elevation via terrain analysis). They are resolved alongside direct variables in stage 2 and processed in stage 4.
| ATTRIBUTE | DESCRIPTION |
|---|---|
name |
Logical name for the derived variable (e.g.,
TYPE:
|
source |
Name of the source
TYPE:
|
method |
Derivation method passed to the derivation function (e.g.,
TYPE:
|
units |
Units of the derived variable (e.g.,
TYPE:
|
long_name |
Human-readable description for metadata.
TYPE:
|
DerivedCategoricalSpec
¶
Bases: BaseModel
Describe a categorical variable derived from multiple source variables.
Multi-source categorical derivations classify pixels by combining two or more source bands (e.g., USDA texture triangle from sand/silt/clay percentages). The result is a single-band categorical raster processed with categorical zonal statistics to produce per-class fraction columns.
Unlike DerivedVariableSpec (single source, continuous output),
this always produces categorical output with per-class fractions.
| ATTRIBUTE | DESCRIPTION |
|---|---|
name |
Logical name for the derived variable (e.g.,
TYPE:
|
sources |
Names of the source
TYPE:
|
method |
Classification method key used to look up the derivation
function via
TYPE:
|
units |
Units of the derived variable (typically
TYPE:
|
long_name |
Human-readable description for metadata.
TYPE:
|
DerivedContinuousSpec
¶
Bases: BaseModel
Describe a continuous variable derived from pixel-level arithmetic on multiple sources.
Multi-source continuous derivations apply an arithmetic operation (multiply, divide, add, subtract) to two or more aligned source rasters before zonal statistics. This preserves within-HRU spatial correlation that would be lost by aggregating each raster independently and combining the results.
Unlike DerivedCategoricalSpec (multi-source, categorical output),
this always produces continuous output processed with standard
zonal statistics (mean, median, etc.).
| ATTRIBUTE | DESCRIPTION |
|---|---|
name |
Logical name for the derived variable (e.g.,
TYPE:
|
sources |
Names of the source
TYPE:
|
operation |
Arithmetic operation applied left-to-right across sources
via
TYPE:
|
align_to |
Name of the source whose grid (resolution, extent, CRS)
is used as the resampling template. Must be one of
TYPE:
|
units |
Units of the derived variable after the operation.
TYPE:
|
long_name |
Human-readable description for metadata.
TYPE:
|
scale_factor |
Multiplicative factor applied to zonal statistics output (e.g., 0.01 to convert from percent to fraction).
TYPE:
|
resampling_method |
Rasterio resampling method name for aligning non-template
sources (default
TYPE:
|
DownloadFile
¶
Bases: BaseModel
Describe a single downloadable file in a multi-file dataset.
Used by the hydro-param datasets download CLI command to stage
local data files required by the local_tiff access strategy.
| ATTRIBUTE | DESCRIPTION |
|---|---|
year |
Calendar year this file covers.
TYPE:
|
variable |
Variable name this file provides (e.g.,
TYPE:
|
url |
Direct download URL for the file.
TYPE:
|
size_gb |
Approximate file size in gigabytes for progress reporting.
TYPE:
|
DownloadInfo
¶
Bases: BaseModel
Describe download provenance for datasets requiring local staging.
Some datasets (e.g., POLARIS soil data, GFv1.1 rasters) cannot be
accessed through STAC or OPeNDAP and must be downloaded to local disk
before processing. DownloadInfo records where to get the data,
how large it is, and whether requester-pays access is needed.
Supports two modes: explicit files (a fixed list of
DownloadFile entries) and template mode (a URL template
expanded over year_range x variables_available).
| ATTRIBUTE | DESCRIPTION |
|---|---|
url |
Single-file download URL (mutually exclusive with
TYPE:
|
size_gb |
Approximate total download size in gigabytes.
TYPE:
|
format |
File format description (e.g.,
TYPE:
|
notes |
Human-readable notes about access requirements.
TYPE:
|
files |
Explicit list of downloadable files (multi-file datasets).
TYPE:
|
url_template |
Python format string with
TYPE:
|
year_range |
Two-element
TYPE:
|
variables_available |
Variable names available for template expansion. Required when
TYPE:
|
requester_pays |
TYPE:
|
| RAISES | DESCRIPTION |
|---|---|
ValueError
|
If none of |
expand_files
¶
expand_files(
*,
years: set[int] | None = None,
variables: set[str] | None = None,
) -> list[DownloadFile]
Expand download sources into a concrete list of files.
For template mode, iterate year_range x variables_available
and format the url_template with {variable} and {year}
placeholders. For explicit files mode, return the files
list. In both modes, optional years and variables filters
restrict the output.
| PARAMETER | DESCRIPTION |
|---|---|
years
|
If given, only include files matching these calendar years.
TYPE:
|
variables
|
If given, only include files matching these variable names.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
list[DownloadFile]
|
Expanded and filtered list of downloadable files. |
Source code in src/hydro_param/dataset_registry.py
DatasetEntry
¶
Bases: BaseModel
Describe a single dataset in the registry.
Each entry captures everything needed to access, process, and normalize a source dataset: the access strategy, connection parameters (STAC catalog URL, collection, asset key, etc.), coordinate system, and the list of available variables.
The strategy field determines which data access pathway is used:
"stac_cog"-- STAC COG via Planetary Computer (3DEP, gNATSGO)."local_tiff"-- local GeoTIFF files (POLARIS, GFv1.1)."nhgf_stac"-- NHGF STAC catalog (NLCD Annual on OSN)."climr_cat"-- ClimateR-Catalog via OPeNDAP (gridMET)."native_zarr"/"converted_zarr"-- Zarr stores (planned).
| ATTRIBUTE | DESCRIPTION |
|---|---|
description |
Human-readable dataset description.
TYPE:
|
strategy |
Data access strategy identifier.
TYPE:
|
catalog_url |
STAC catalog URL (required for
TYPE:
|
collection |
STAC collection name (required for
TYPE:
|
asset_key |
Default STAC asset key. Default
TYPE:
|
gsd |
Ground sample distance in metres (STAC COG spatial resolution).
TYPE:
|
sign |
STAC signing method (e.g.,
TYPE:
|
source |
Local file path or remote URL for Zarr/local_tiff datasets.
TYPE:
|
download |
Download provenance for datasets requiring local staging.
TYPE:
|
catalog_id |
ClimateR-Catalog identifier (required for
TYPE:
|
crs |
Coordinate reference system as an EPSG string. Default
TYPE:
|
x_coord |
Name of the x/longitude coordinate. Default
TYPE:
|
y_coord |
Name of the y/latitude coordinate. Default
TYPE:
|
t_coord |
Name of the time coordinate (required for temporal datasets).
TYPE:
|
variables |
Variables directly available in this dataset.
TYPE:
|
derived_variables |
Variables computed from other variables in this dataset.
TYPE:
|
category |
Dataset category for grouping (e.g.,
TYPE:
|
temporal |
TYPE:
|
time_step |
Temporal resolution of the dataset. Required when
TYPE:
|
year_range |
Two-element
TYPE:
|
| RAISES | DESCRIPTION |
|---|---|
ValueError
|
If required strategy-specific fields are missing, or if
constraints are violated (e.g., temporal without |
DatasetRegistry
¶
Bases: BaseModel
Contain and query all registered datasets.
Provides lookup by name and variable resolution across the full set
of loaded datasets. Typically created by load_registry() from
one or more YAML files.
| ATTRIBUTE | DESCRIPTION |
|---|---|
datasets |
Mapping of dataset name to entry. Names must be unique across all registry files.
TYPE:
|
See Also
load_registry : Load a registry from YAML file(s).
get
¶
Look up a dataset by name.
| PARAMETER | DESCRIPTION |
|---|---|
name
|
Dataset name as it appears in the registry YAML (e.g.,
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
DatasetEntry
|
The matching dataset entry. |
| RAISES | DESCRIPTION |
|---|---|
KeyError
|
If |
Source code in src/hydro_param/dataset_registry.py
resolve_variable
¶
resolve_variable(
dataset_name: str, variable_name: str
) -> (
VariableSpec
| DerivedVariableSpec
| DerivedCategoricalSpec
| DerivedContinuousSpec
)
Resolve a variable name to its specification within a dataset.
Search direct variables, derived variables, derived categorical variables, and derived continuous variables in the named dataset. Direct variables are checked first.
| PARAMETER | DESCRIPTION |
|---|---|
dataset_name
|
Dataset name in the registry (e.g.,
TYPE:
|
variable_name
|
Variable name to look up (e.g.,
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
VariableSpec or DerivedVariableSpec or DerivedCategoricalSpec or DerivedContinuousSpec
|
The matching variable specification. Direct variables are checked first, then derived, then derived categorical, then derived continuous. |
| RAISES | DESCRIPTION |
|---|---|
KeyError
|
If the dataset is not found in the registry, or the variable is not found in the dataset. The error message lists all available variable names for debugging. |
Source code in src/hydro_param/dataset_registry.py
get_all_dataset_names
¶
Return the set of all dataset names in the registry.
| PARAMETER | DESCRIPTION |
|---|---|
registry
|
A loaded dataset registry.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
set[str]
|
All dataset names (e.g., |
Source code in src/hydro_param/dataset_registry.py
load_registry
¶
Load a dataset registry from YAML file(s), with optional overlays.
When path is a directory, all *.yml and *.yaml files are
loaded and merged into a single registry. Dataset names must be
unique across all files -- duplicates raise ValueError.
Overlay directories (e.g., ~/.hydro-param/datasets/) are scanned
after the primary registry. Overlay entries are merged into the
result; on name collision, the overlay entry replaces the primary
entry (no partial merge). Non-existent or empty overlay directories
are silently skipped.
| PARAMETER | DESCRIPTION |
|---|---|
path
|
Path to a single registry YAML file, or a directory containing
per-category YAML files (e.g., the bundled
TYPE:
|
overlay_dirs
|
Optional list of directories containing user-local registry
overlays. Each directory is scanned for
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
DatasetRegistry
|
Merged registry containing all datasets found. |
| RAISES | DESCRIPTION |
|---|---|
FileNotFoundError
|
If |
ValueError
|
If a dataset name appears in more than one YAML file within the primary registry directory. Overlay collisions with the primary registry are resolved silently (overlay wins). |
Examples:
>>> from hydro_param.pipeline import DEFAULT_REGISTRY
>>> registry = load_registry(DEFAULT_REGISTRY)
>>> entry = registry.get("dem_3dep_10m")