config¶

config ¶

Pipeline configuration: Pydantic models and YAML loader.

Define the declarative configuration schema for the hydro-param pipeline, matching design.md section 11.6. Configs express what to compute (target fabric, datasets, statistics, output format) but never how -- all processing logic lives in Python code, not in YAML.

The schema is validated at load time by Pydantic v2 so that invalid configs fail fast with clear error messages before any data is fetched.

TargetFabricConfig ¶

Bases: BaseModel

Specify the target polygon fabric to parameterize.

The target fabric is the spatial mesh (catchments, HRUs, grid cells) whose features receive zonal statistics from source datasets. The fabric must be a pre-existing geospatial file -- hydro-param does not fetch or subset fabrics (use pynhd/pygeohydro upstream).

ATTRIBUTE	DESCRIPTION
`path`	Path to the fabric file (GeoPackage, GeoParquet, or Shapefile). TYPE: `Path`
`id_field`	Column name containing unique feature identifiers. This becomes the index/dimension name in all output files and the SIR xarray Dataset. TYPE: `str`
`crs`	Coordinate reference system of the fabric file as an EPSG string. Defaults to `"EPSG:4326"` (WGS 84). TYPE: `str`

Notes

The id_field propagates through the entire pipeline: it controls the xarray dimension name in the SIR, the CSV index column, and the feature matching in the pywatershed derivation plugin. Typical values are "nhm_id" (pywatershed/NHM), "featureid" (NHDPlus), or "hru_id" (custom fabrics).

DomainConfig ¶

Bases: BaseModel

Define the spatial domain that restricts which fabric features are processed.

When a domain is configured, stage 1 clips the target fabric to the specified extent before any data fetching or zonal statistics. When omitted, the full fabric extent is used.

Only type="bbox" is currently implemented; HUC and gage-based subsetting are planned.

ATTRIBUTE	DESCRIPTION
`type`	Domain specification method. TYPE: `{'bbox', 'huc2', 'huc4', 'gage'}`
`bbox`	Bounding box as `[west, south, east, north]` in EPSG:4326 (degrees). Required when `type="bbox"`. TYPE: `list[float] or None`
`id`	Identifier for HUC or gage-based domains (e.g., HUC-2 code or USGS gage ID). Required when `type` is `"huc2"`, `"huc4"`, or `"gage"`. TYPE: `str or None`

RAISES	DESCRIPTION
`ValueError`	If the required field for the chosen `type` is missing.

DatasetRequest ¶

Bases: BaseModel

Request a dataset and its variables for pipeline processing.

Each entry within a category list in the datasets: dict of a pipeline YAML config becomes one DatasetRequest. The name is resolved against the dataset registry to obtain fetch strategy, STAC collection, CRS, and variable metadata.

ATTRIBUTE	DESCRIPTION
`name`	Dataset name as it appears in the registry (e.g., `"dem_3dep_10m"`). TYPE: `str`
`source`	Local file path override for `local_tiff` datasets. When set, this takes precedence over the registry-level `source` field. TYPE: `Path or None`
`variables`	Variable names to extract (e.g., `["elevation", "slope"]`). Empty list means no variables requested (unusual but valid). TYPE: `list[str]`
`statistics`	Zonal statistics to compute for each variable. Defaults to `["mean"]`. Common values: `"mean"`, `"majority"`, `"minority"`, `"sum"`, `"min"`, `"max"`, `"median"`. TYPE: `list[str]`
`year`	Year(s) for multi-year static datasets (e.g., NLCD on OSN). When a list is provided, the pipeline iterates over each year and produces year-suffixed output keys (e.g., `"land_cover_2019"`). Valid range: 1900--2100. TYPE: `int or list[int] or None`
`time_period`	`[start, end]` ISO date strings (`"YYYY-MM-DD"`) for temporal datasets (e.g., gridMET, SNODAS). Required when the registry marks the dataset as `temporal: true`. TYPE: `list[str] or None`

RAISES	DESCRIPTION
`ValueError`	If `year` list is empty, a year is outside 1900--2100, or `time_period` dates are invalid or out of order.

OutputConfig ¶

Bases: BaseModel

Configure pipeline output location and format.

ATTRIBUTE	DESCRIPTION
`path`	Directory for output files. Created automatically if it does not exist. Subdirectories are created per dataset category (e.g., `topography/`, `soils/`). Defaults to `"./output"`. TYPE: `Path`
`format`	File format for temporal output. Static per-variable files are always written as CSV. Defaults to `"netcdf"`. TYPE: `{'netcdf', 'parquet'}`
`sir_name`	Human-readable name for the output, used in CF-1.8 metadata attributes and log messages. Defaults to `"result"`. TYPE: `str`

ProcessingConfig ¶

Bases: BaseModel

Control batching, fault tolerance, and networking.

ATTRIBUTE	DESCRIPTION
`batch_size`	Maximum number of features per spatial batch. KD-tree recursive bisection groups nearby features to minimize data fetch extent. Must be > 0. Defaults to 500. TYPE: `int`
`resume`	When `True`, skip datasets whose outputs are already current (checked via the pipeline manifest fingerprint). Defaults to `False`. TYPE: `bool`
`sir_validation`	SIR validation mode for stage 5. `"strict"` raises on any validation warning; `"tolerant"` logs warnings and continues. Defaults to `"tolerant"`. TYPE: `{'tolerant', 'strict'}`
`network_timeout`	Timeout in seconds for GDAL HTTP operations (COG/vsicurl access). Applied to both `GDAL_HTTP_TIMEOUT` and `GDAL_HTTP_CONNECTTIMEOUT` environment variables. Must be > 0. Defaults to 120. TYPE: `int`

PipelineConfig ¶

Bases: BaseModel

Top-level pipeline configuration loaded from a YAML file.

This is the root model that :func:load_config deserializes. It composes all sub-configs and is consumed by every pipeline stage.

ATTRIBUTE	DESCRIPTION
`target_fabric`	Polygon mesh to parameterize. TYPE: `TargetFabricConfig`
`domain`	Optional spatial subsetting. When `None`, the full fabric extent is used. TYPE: `DomainConfig or None`
`datasets`	Datasets organized by category (e.g., `"topography"`, `"soils"`). Category keys must be members of :data:`~hydro_param.dataset_registry.VALID_CATEGORIES`. TYPE: `dict[str, list[DatasetRequest]]`
`output`	Output location and format. TYPE: `OutputConfig`
`processing`	Engine, batching, and fault-tolerance settings. TYPE: `ProcessingConfig`

flatten_datasets ¶

flatten_datasets() -> list[DatasetRequest]

Flatten themed dataset dict into a single list for pipeline stages.

Bridge the category-keyed config format to pipeline stages that expect a flat iterable of dataset requests. This allows pipeline internals to remain agnostic to the themed grouping while the config YAML stays organized by domain category.

RETURNS	DESCRIPTION
`list[DatasetRequest]`	All dataset requests from all categories, preserving order within each category.

Notes

Dict insertion order (guaranteed since Python 3.7) preserves intra-category order. Cross-category order follows YAML key order but is not semantically meaningful -- pipeline stages process each dataset independently.

Source code in src/hydro_param/config.py

def flatten_datasets(self) -> list[DatasetRequest]:
    """Flatten themed dataset dict into a single list for pipeline stages.

    Bridge the category-keyed config format to pipeline stages that expect
    a flat iterable of dataset requests.  This allows pipeline internals to
    remain agnostic to the themed grouping while the config YAML stays
    organized by domain category.

    Returns
    -------
    list[DatasetRequest]
        All dataset requests from all categories, preserving order
        within each category.

    Notes
    -----
    Dict insertion order (guaranteed since Python 3.7) preserves
    intra-category order.  Cross-category order follows YAML key order
    but is not semantically meaningful -- pipeline stages process each
    dataset independently.
    """
    return [ds for ds_list in self.datasets.values() for ds in ds_list]

load_config ¶

load_config(path: str | Path) -> PipelineConfig

Load and validate a pipeline YAML config file.

Parse the YAML file at path and return a fully validated :class:PipelineConfig. Pydantic model validators run during construction, so any schema violations raise immediately with descriptive error messages.

After validation, all relative paths (target_fabric.path, output.path, per-dataset source) are resolved to absolute paths using the current working directory. This ensures that downstream operations (manifest save/load, file existence checks) work consistently regardless of internal path manipulation.

PARAMETER	DESCRIPTION
`path`	Path to a YAML pipeline configuration file. TYPE: `str or Path`

RETURNS	DESCRIPTION
`PipelineConfig`	Validated pipeline configuration with all paths resolved to absolute paths, ready for :func:`~hydro_param.pipeline.run_pipeline_from_config`.

RAISES	DESCRIPTION
`FileNotFoundError`	If path does not exist.
`YAMLError`	If the file is not valid YAML.
`ValidationError`	If the YAML content does not match the config schema.

Notes

Relative paths in the YAML are interpreted relative to the current working directory (the standard convention when running hydro-param run configs/pipeline.yml from the project root). Absolute paths are left unchanged.

Source code in src/hydro_param/config.py

def load_config(path: str | Path) -> PipelineConfig:
    """Load and validate a pipeline YAML config file.

    Parse the YAML file at *path* and return a fully validated
    :class:`PipelineConfig`.  Pydantic model validators run during
    construction, so any schema violations raise immediately with
    descriptive error messages.

    After validation, all relative paths (``target_fabric.path``,
    ``output.path``, per-dataset ``source``) are resolved to absolute
    paths using the current working directory.  This ensures that
    downstream operations (manifest save/load, file existence checks)
    work consistently regardless of internal path manipulation.

    Parameters
    ----------
    path : str or Path
        Path to a YAML pipeline configuration file.

    Returns
    -------
    PipelineConfig
        Validated pipeline configuration with all paths resolved to
        absolute paths, ready for
        :func:`~hydro_param.pipeline.run_pipeline_from_config`.

    Raises
    ------
    FileNotFoundError
        If *path* does not exist.
    yaml.YAMLError
        If the file is not valid YAML.
    pydantic.ValidationError
        If the YAML content does not match the config schema.

    Notes
    -----
    Relative paths in the YAML are interpreted relative to the current
    working directory (the standard convention when running
    ``hydro-param run configs/pipeline.yml`` from the project root).
    Absolute paths are left unchanged.
    """
    with open(path) as f:
        raw = yaml.safe_load(f)
    config = PipelineConfig(**raw)
    return _resolve_paths(config)