Skip to content

config

config

Pipeline configuration: Pydantic models and YAML loader.

Define the declarative configuration schema for the hydro-param pipeline, matching design.md section 11.6. Configs express what to compute (target fabric, datasets, statistics, output format) but never how -- all processing logic lives in Python code, not in YAML.

The schema is validated at load time by Pydantic v2 so that invalid configs fail fast with clear error messages before any data is fetched.

See Also

hydro_param.pipeline : Orchestrator that consumes these config objects. hydro_param.dataset_registry : Registry that resolves dataset names referenced in :class:DatasetRequest.

TargetFabricConfig

Bases: BaseModel

Specify the target polygon fabric to parameterize.

The target fabric is the spatial mesh (catchments, HRUs, grid cells) whose features receive zonal statistics from source datasets. The fabric must be a pre-existing geospatial file -- hydro-param does not fetch or subset fabrics (use pynhd/pygeohydro upstream).

ATTRIBUTE DESCRIPTION
path

Path to the fabric file (GeoPackage, GeoParquet, or Shapefile).

TYPE: Path

id_field

Column name containing unique feature identifiers. This becomes the index/dimension name in all output files and the SIR xarray Dataset.

TYPE: str

crs

Coordinate reference system of the fabric file as an EPSG string. Defaults to "EPSG:4326" (WGS 84).

TYPE: str

Notes

The id_field propagates through the entire pipeline: it controls the xarray dimension name in the SIR, the CSV index column, and the feature matching in the pywatershed derivation plugin. Typical values are "nhm_id" (pywatershed/NHM), "featureid" (NHDPlus), or "hru_id" (custom fabrics).

DomainConfig

Bases: BaseModel

Define the spatial domain that restricts which fabric features are processed.

When a domain is configured, stage 1 clips the target fabric to the specified extent before any data fetching or zonal statistics. When omitted, the full fabric extent is used.

Only type="bbox" is currently implemented; HUC and gage-based subsetting are planned.

ATTRIBUTE DESCRIPTION
type

Domain specification method.

TYPE: {'bbox', 'huc2', 'huc4', 'gage'}

bbox

Bounding box as [west, south, east, north] in EPSG:4326 (degrees). Required when type="bbox".

TYPE: list[float] or None

id

Identifier for HUC or gage-based domains (e.g., HUC-2 code or USGS gage ID). Required when type is "huc2", "huc4", or "gage".

TYPE: str or None

RAISES DESCRIPTION
ValueError

If the required field for the chosen type is missing.

DatasetRequest

Bases: BaseModel

Request a dataset and its variables for pipeline processing.

Each entry within a category list in the datasets: dict of a pipeline YAML config becomes one DatasetRequest. The name is resolved against the dataset registry to obtain fetch strategy, STAC collection, CRS, and variable metadata.

ATTRIBUTE DESCRIPTION
name

Dataset name as it appears in the registry (e.g., "dem_3dep_10m").

TYPE: str

source

Local file path override for local_tiff datasets. When set, this takes precedence over the registry-level source field.

TYPE: Path or None

variables

Variable names to extract (e.g., ["elevation", "slope"]). Empty list means no variables requested (unusual but valid).

TYPE: list[str]

statistics

Zonal statistics to compute for each variable. Defaults to ["mean"]. Common values: "mean", "majority", "minority", "sum", "min", "max", "median".

TYPE: list[str]

year

Year(s) for multi-year static datasets (e.g., NLCD on OSN). When a list is provided, the pipeline iterates over each year and produces year-suffixed output keys (e.g., "land_cover_2019"). Valid range: 1900--2100.

TYPE: int or list[int] or None

time_period

[start, end] ISO date strings ("YYYY-MM-DD") for temporal datasets (e.g., gridMET, SNODAS). Required when the registry marks the dataset as temporal: true.

TYPE: list[str] or None

RAISES DESCRIPTION
ValueError

If year list is empty, a year is outside 1900--2100, or time_period dates are invalid or out of order.

See Also

hydro_param.dataset_registry.DatasetEntry : Registry metadata resolved from name.

OutputConfig

Bases: BaseModel

Configure pipeline output location and format.

ATTRIBUTE DESCRIPTION
path

Directory for output files. Created automatically if it does not exist. Subdirectories are created per dataset category (e.g., topography/, soils/). Defaults to "./output".

TYPE: Path

format

File format for temporal output. Static per-variable files are always written as CSV. Defaults to "netcdf".

TYPE: {'netcdf', 'parquet'}

sir_name

Human-readable name for the output, used in CF-1.8 metadata attributes and log messages. Defaults to "result".

TYPE: str

ProcessingConfig

Bases: BaseModel

Control batching, fault tolerance, and networking.

ATTRIBUTE DESCRIPTION
batch_size

Maximum number of features per spatial batch. KD-tree recursive bisection groups nearby features to minimize data fetch extent. Must be > 0. Defaults to 500.

TYPE: int

resume

When True, skip datasets whose outputs are already current (checked via the pipeline manifest fingerprint). Defaults to False.

TYPE: bool

sir_validation

SIR validation mode for stage 5. "strict" raises on any validation warning; "tolerant" logs warnings and continues. Defaults to "tolerant".

TYPE: {'tolerant', 'strict'}

network_timeout

Timeout in seconds for GDAL HTTP operations (COG/vsicurl access). Applied to both GDAL_HTTP_TIMEOUT and GDAL_HTTP_CONNECTTIMEOUT environment variables. Must be > 0. Defaults to 120.

TYPE: int

PipelineConfig

Bases: BaseModel

Top-level pipeline configuration loaded from a YAML file.

This is the root model that :func:load_config deserializes. It composes all sub-configs and is consumed by every pipeline stage.

ATTRIBUTE DESCRIPTION
target_fabric

Polygon mesh to parameterize.

TYPE: TargetFabricConfig

domain

Optional spatial subsetting. When None, the full fabric extent is used.

TYPE: DomainConfig or None

datasets

Datasets organized by category (e.g., "topography", "soils"). Category keys must be members of :data:~hydro_param.dataset_registry.VALID_CATEGORIES.

TYPE: dict[str, list[DatasetRequest]]

output

Output location and format.

TYPE: OutputConfig

processing

Engine, batching, and fault-tolerance settings.

TYPE: ProcessingConfig

See Also

load_config : Load and validate a YAML file into this model. hydro_param.pipeline.run_pipeline : Execute the pipeline from a config path.

flatten_datasets

flatten_datasets() -> list[DatasetRequest]

Flatten themed dataset dict into a single list for pipeline stages.

Bridge the category-keyed config format to pipeline stages that expect a flat iterable of dataset requests. This allows pipeline internals to remain agnostic to the themed grouping while the config YAML stays organized by domain category.

RETURNS DESCRIPTION
list[DatasetRequest]

All dataset requests from all categories, preserving order within each category.

Notes

Dict insertion order (guaranteed since Python 3.7) preserves intra-category order. Cross-category order follows YAML key order but is not semantically meaningful -- pipeline stages process each dataset independently.

Source code in src/hydro_param/config.py
def flatten_datasets(self) -> list[DatasetRequest]:
    """Flatten themed dataset dict into a single list for pipeline stages.

    Bridge the category-keyed config format to pipeline stages that expect
    a flat iterable of dataset requests.  This allows pipeline internals to
    remain agnostic to the themed grouping while the config YAML stays
    organized by domain category.

    Returns
    -------
    list[DatasetRequest]
        All dataset requests from all categories, preserving order
        within each category.

    Notes
    -----
    Dict insertion order (guaranteed since Python 3.7) preserves
    intra-category order.  Cross-category order follows YAML key order
    but is not semantically meaningful -- pipeline stages process each
    dataset independently.
    """
    return [ds for ds_list in self.datasets.values() for ds in ds_list]

load_config

load_config(path: str | Path) -> PipelineConfig

Load and validate a pipeline YAML config file.

Parse the YAML file at path and return a fully validated :class:PipelineConfig. Pydantic model validators run during construction, so any schema violations raise immediately with descriptive error messages.

After validation, all relative paths (target_fabric.path, output.path, per-dataset source) are resolved to absolute paths using the current working directory. This ensures that downstream operations (manifest save/load, file existence checks) work consistently regardless of internal path manipulation.

PARAMETER DESCRIPTION
path

Path to a YAML pipeline configuration file.

TYPE: str or Path

RETURNS DESCRIPTION
PipelineConfig

Validated pipeline configuration with all paths resolved to absolute paths, ready for :func:~hydro_param.pipeline.run_pipeline_from_config.

RAISES DESCRIPTION
FileNotFoundError

If path does not exist.

YAMLError

If the file is not valid YAML.

ValidationError

If the YAML content does not match the config schema.

Notes

Relative paths in the YAML are interpreted relative to the current working directory (the standard convention when running hydro-param run configs/pipeline.yml from the project root). Absolute paths are left unchanged.

Source code in src/hydro_param/config.py
def load_config(path: str | Path) -> PipelineConfig:
    """Load and validate a pipeline YAML config file.

    Parse the YAML file at *path* and return a fully validated
    :class:`PipelineConfig`.  Pydantic model validators run during
    construction, so any schema violations raise immediately with
    descriptive error messages.

    After validation, all relative paths (``target_fabric.path``,
    ``output.path``, per-dataset ``source``) are resolved to absolute
    paths using the current working directory.  This ensures that
    downstream operations (manifest save/load, file existence checks)
    work consistently regardless of internal path manipulation.

    Parameters
    ----------
    path : str or Path
        Path to a YAML pipeline configuration file.

    Returns
    -------
    PipelineConfig
        Validated pipeline configuration with all paths resolved to
        absolute paths, ready for
        :func:`~hydro_param.pipeline.run_pipeline_from_config`.

    Raises
    ------
    FileNotFoundError
        If *path* does not exist.
    yaml.YAMLError
        If the file is not valid YAML.
    pydantic.ValidationError
        If the YAML content does not match the config schema.

    Notes
    -----
    Relative paths in the YAML are interpreted relative to the current
    working directory (the standard convention when running
    ``hydro-param run configs/pipeline.yml`` from the project root).
    Absolute paths are left unchanged.
    """
    with open(path) as f:
        raw = yaml.safe_load(f)
    config = PipelineConfig(**raw)
    return _resolve_paths(config)