Configuration¶

hydro-param uses two YAML configuration files that correspond to its two-phase architecture:

Pipeline config (Phase 1) --- Declares what datasets to fetch and what zonal statistics to compute. Produces a Standardized Internal Representation (SIR) on disk.
pywatershed run config (Phase 2) --- Maps SIR output to model-specific parameters, forcing files, and control settings for pywatershed.

Both configs are declarative only: they say what to compute, not how. No variables, conditionals, or templating. All logic lives in Python.

Config validation

Both configs are validated at load time by Pydantic v2. Invalid configs fail fast with clear error messages before any data is fetched or parameters are derived.

Pipeline config (Phase 1)¶

The pipeline config drives the hydro-param run command. It has five top-level sections:

Section	Required	Purpose
`target_fabric`	Yes	Polygon mesh to parameterize
`datasets`	Yes	Datasets, variables, and statistics to compute
`output`	No	Output directory, format, and name
`processing`	No	Batch size, resume, timeout
`domain`	No	Spatial subsetting (bbox, HUC, gage)

`target_fabric`¶

The spatial mesh whose features receive zonal statistics. Must be a pre-existing geospatial file --- hydro-param does not fetch or subset fabrics (use pynhd upstream).

Field	Type	Default	Description
`path`	string	required	Path to fabric file (GeoPackage or GeoParquet)
`id_field`	string	required	Column with unique feature IDs. Becomes the xarray dimension name in all output.
`crs`	string	`"EPSG:4326"`	Coordinate reference system as an EPSG string

Choosing id_field

The id_field propagates through the entire pipeline: it controls the xarray dimension name in the SIR, the CSV index column, and feature matching in the pywatershed derivation plugin. Common values:

nhm_id --- pywatershed / National Hydrologic Model
featureid --- NHDPlus
hru_id --- custom fabrics

`datasets`¶

Datasets are organized by category. Each category key must be one of the valid registry categories:

climate, geology, hydrography, land_cover, snow, soils, topography, water_bodies

Each category contains a list of dataset requests:

Field	Type	Default	Description
`name`	string	required	Dataset name from the registry (e.g., `dem_3dep_10m`)
`source`	string	`null`	Local file path override for `local_tiff` datasets
`variables`	list	`[]`	Variable names to extract
`statistics`	list	`["mean"]`	Zonal statistics: `mean`, `categorical`, `majority`, `sum`, `min`, `max`, `median`
`year`	int or list	`null`	Year(s) for multi-year static datasets (e.g., NLCD). Range: 1900--2100.
`time_period`	list	`null`	`[start, end]` ISO dates for temporal datasets

Static vs. temporal datasets

Static datasets (topography, soils, land cover) use year for multi-epoch data. The pipeline produces year-suffixed output keys (e.g., lndcov_frac_2021).
Temporal datasets (climate, snow) use time_period with ISO date strings. The pipeline splits processing by calendar year automatically.

`output`¶

Field	Type	Default	Description
`path`	string	`"./output"`	Output directory (created automatically)
`format`	string	`"netcdf"`	Format for temporal output: `netcdf` or `parquet`
`sir_name`	string	`"result"`	Name used in CF-1.8 metadata and log messages

`processing`¶

Field	Type	Default	Description
`batch_size`	int	`500`	Max features per spatial batch. KD-tree bisection groups nearby features.
`resume`	bool	`false`	Skip datasets whose outputs already exist (checked via manifest)
`sir_validation`	string	`"tolerant"`	`"tolerant"` logs warnings; `"strict"` raises on any issue
`network_timeout`	int	`120`	GDAL HTTP timeout in seconds for COG/vsicurl access

`domain` (optional)¶

When present, clips the target fabric to the specified extent before processing. When omitted, the full fabric extent is used.

Field	Type	Default	Description
`type`	string	required	`bbox`, `huc2`, `huc4`, or `gage`
`bbox`	list	`null`	`[west, south, east, north]` in EPSG:4326. Required when `type: bbox`.
`id`	string	`null`	HUC code or gage ID. Required when type is `huc2`, `huc4`, or `gage`.

Complete example¶

This is a production pipeline config for the Delaware River Basin, fetching topography, soils, land cover, snow, and climate data over a 2-year period:

configs/examples/drb_2yr_pipeline.yml

# DRB 2-year pipeline config: pywatershed data requirements
# Produces a SIR with topographic, soils, land cover, snow, and climate data.
#
# Usage:
#   hydro-param run configs/examples/drb_2yr_pipeline.yml

target_fabric:
  path: data/pywatershed_gis/drb_2yr/nhru.gpkg  # (1)!
  id_field: nhm_id  # (2)!

datasets:
  topography:
    # 3DEP 10m DEM — elevation, slope, sin/cos aspect for circular mean
    - name: dem_3dep_10m
      variables: [elevation, slope, sin_aspect, cos_aspect]
      statistics: [mean]

  soils:
    # gNATSGO pre-summarized soil properties (Planetary Computer)
    - name: gnatsgo_rasters
      variables: [aws0_100, rootznemc, rootznaws]
      statistics: [mean]

    # POLARIS soil texture properties, 30m (remote VRT, no download needed)
    - name: polaris_30m
      variables: [sand, silt, clay, theta_s, ksat, soil_texture]
      statistics: [mean]

  land_cover:
    # NLCD Land Cover via NHGF STAC OSN — categorical fractions (multi-year)
    - name: nlcd_osn_lndcov
      variables: [LndCov]
      statistics: [categorical]  # (3)!
      year: [2020, 2021]  # (4)!

    # NLCD Fractional Impervious via NHGF STAC OSN
    - name: nlcd_osn_fctimp
      variables: [FctImp]
      statistics: [mean]
      year: [2020, 2021]

  snow:
    # SNODAS daily snow — historical SWE for snarea_thresh calibration seed
    - name: snodas
      variables: [SWE]
      statistics: [mean]
      time_period: ["2020-01-01", "2021-12-31"]  # (5)!

  climate:
    # gridMET daily climate (OPeNDAP via ClimateR catalog)
    - name: gridmet
      variables: [pr, tmmx, tmmn, srad, pet, vs]
      statistics: [mean]
      time_period: ["2020-01-01", "2021-12-31"]

output:
  path: output
  format: netcdf
  sir_name: drb_2yr_sir

processing:
  batch_size: 240  # (6)!
  resume: true  # (7)!

Path to the HRU fabric GeoPackage. Must exist before running.
The nhm_id column becomes the xarray dimension in all SIR output files.
categorical produces per-class fraction columns (e.g., lndcov_frac_11, lndcov_frac_21, ...) instead of a single summary statistic.
Multi-year requests produce year-suffixed output (e.g., lndcov_frac_2021). The most recent year is used during Phase 2 derivation.
Temporal datasets require time_period with ISO date strings. The pipeline splits processing by calendar year automatically.
Smaller batch sizes reduce peak memory. 240 works well for DRB's 765 HRUs.
resume: true skips datasets whose manifest fingerprints are current, allowing interrupted runs to continue where they left off.

pywatershed run config (Phase 2)¶

The pywatershed run config drives the hydro-param pywatershed run command. It consumes SIR output from Phase 1 and produces model-ready parameter files, forcing time series, solar tables, and a control file.

This config creates a consumer-oriented contract between the generic pipeline and the model-specific derivation plugin. Each parameter entry declares exactly which SIR dataset and variable provides its source data.

Section	Required	Purpose
`target_model`	No	Fixed to `"pywatershed"`
`version`	No	Schema version (currently `"4.0"`)
`sir_path`	No	Path to Phase 1 output directory
`domain`	Yes	Fabric and segment file paths
`time`	Yes	Simulation time period
`static_datasets`	No	Maps SIR data to static parameters
`forcing`	No	Maps SIR data to forcing time series
`climate_normals`	No	Maps SIR data to climate-derived parameters
`parameter_overrides`	No	Manual parameter value overrides
`calibration`	No	Calibration seed generation options
`output`	No	Output directory and file layout

Extra fields are rejected

All models in the pywatershed config use extra="forbid". Typos or unknown fields produce immediate validation errors rather than being silently ignored.

`sir_path`¶

Path to the Phase 1 pipeline output directory containing the .manifest.yml and SIR data files. Relative paths are resolved against the config file's parent directory.

`domain`¶

Points to pre-existing fabric and segment files. hydro-param does not fetch or subset fabrics.

Field	Type	Default	Description
`fabric_path`	string	required	Path to HRU fabric (GeoPackage or GeoParquet)
`segment_path`	string	`null`	Path to segment/flowline file for routing topology
`waterbody_path`	string	`null`	Path to NHDPlus waterbody polygons for depression storage overlay
`id_field`	string	`"nhm_id"`	Feature ID column in the fabric
`segment_id_field`	string	`"nhm_seg"`	Segment ID column in the segment file

`time`¶

Field	Type	Default	Description
`start`	string	required	Start date in ISO format (e.g., `"2020-01-01"`)
`end`	string	required	End date in ISO format (e.g., `"2021-12-31"`)
`timestep`	string	`"daily"`	Temporal resolution. Only `daily` is supported.

`static_datasets`¶

Groups static parameter declarations by category. Each parameter entry maps a pywatershed parameter name to the SIR dataset that provides its source data.

The five categories are: topography, soils, landcover, snow, and waterbodies. Each category has an available list declaring which registry datasets were used, plus named parameter fields.

Parameter entry fields:

Field	Type	Default	Description
`source`	string	required	Registry dataset name (e.g., `dem_3dep_10m`)
`variable`	string	`null`	Single source variable name
`variables`	list	`null`	Multiple source variables (use one of `variable` or `variables`, not both)
`statistic`	string	`null`	Zonal statistic applied (`mean`, `categorical`)
`year`	int or list	`null`	NLCD year(s) for multi-epoch land cover
`time_period`	list	`null`	Temporal range for temporal datasets
`description`	string	required	Human-readable description

Available parameters by category

Topography: hru_elev, hru_slope, hru_aspect

Soils: soil_type, sat_threshold, soil_moist_max, soil_rechr_max_frac

Landcover: cov_type, hru_percent_imperv, covden_sum, covden_win, srain_intcp, wrain_intcp, snow_intcp

Snow: hru_deplcrv, snarea_thresh

Waterbodies: hru_type, dprst_frac

`forcing`¶

Temporal forcing time series. The derivation plugin converts forcing from SIR units (metric: mm, degC) to PRMS units (inches, degF) during output formatting.

Field	Type	Description
`available`	list	Temporal-capable datasets in the registry
`prcp`	ParameterEntry	Daily precipitation
`tmax`	ParameterEntry	Daily maximum temperature
`tmin`	ParameterEntry	Daily minimum temperature

`climate_normals`¶

Long-term climate statistics for derived parameters. Can use the same source as forcing or a different one (e.g., forcing from CONUS404-BA but normals from gridMET).

Field	Type	Description
`available`	list	Temporal-capable datasets in the registry
`jh_coef`	ParameterEntry	Jensen-Haise PET coefficient (monthly, from tmax/tmin normals)
`transp_beg`	ParameterEntry	Month transpiration begins (from tmin threshold)
`transp_end`	ParameterEntry	Month transpiration ends (from tmin threshold)

`calibration`¶

Field	Type	Default	Description
`generate_seeds`	bool	`true`	Whether to generate calibration seed values
`seed_method`	string	`"physically_based"`	`"physically_based"` or `"all_defaults"`
`preserve_from_existing`	list	`[]`	Parameter names to keep from an existing file

`parameter_overrides`¶

Field	Type	Default	Description
`values`	dict	`{}`	Parameter name to scalar or per-HRU value. Scalars broadcast to all HRUs.
`from_file`	string	`null`	Path to NetCDF/CSV with override values (not yet implemented)

`output`¶

Field	Type	Default	Description
`path`	string	`"./output"`	Root output directory
`format`	string	`"netcdf"`	`"netcdf"` (CF-1.8) or `"prms_text"` (not yet implemented)
`parameter_file`	string	`"parameters.nc"`	Static parameter filename
`forcing_dir`	string	`"forcing"`	Subdirectory for forcing files
`control_file`	string	`"control.yml"`	Simulation control filename
`soltab_file`	string	`"soltab.nc"`	Solar radiation table filename

Complete example¶

configs/examples/drb_2yr_pywatershed.yml

# DRB 2-year pywatershed model setup config (v4.0)
# Generates parameters.nc, forcing/, and control.yml for pywatershed.
#
# Usage:
#   hydro-param pywatershed run configs/examples/drb_2yr_pywatershed.yml

target_model: pywatershed
version: "4.0"

sir_path: "../../output"  # (1)!

domain:
  fabric_path: data/pywatershed_gis/drb_2yr/nhru.gpkg
  segment_path: data/pywatershed_gis/drb_2yr/nsegment.gpkg
  id_field: nhm_id
  segment_id_field: nhm_seg

time:
  start: "2020-01-01"
  end: "2021-12-31"
  timestep: daily

static_datasets:

  topography:
    available: [dem_3dep_10m]
    hru_elev:
      source: dem_3dep_10m
      variable: elevation
      statistic: mean
      description: "Mean HRU elevation"
    hru_slope:
      source: dem_3dep_10m
      variable: slope
      statistic: mean
      description: "Mean land surface slope"

  soils:
    available: [polaris_30m, gnatsgo_rasters]
    soil_type:  # (2)!
      source: polaris_30m
      variables: [sand, silt, clay]
      statistic: mean
      description: "Soil type classification (1=sand, 2=loam, 3=clay)"
    sat_threshold:
      source: polaris_30m
      variable: theta_s
      statistic: mean
      description: "Gravity reservoir storage capacity (from porosity)"
    soil_moist_max:
      source: gnatsgo_rasters
      variable: aws0_100
      statistic: mean
      description: "Maximum available water-holding capacity"
    soil_rechr_max_frac:
      source: gnatsgo_rasters
      variables: [rootznemc, rootznaws]
      statistic: mean
      description: "Recharge zone storage as fraction of soil_moist_max"

  landcover:
    available: [nlcd_osn_lndcov, nlcd_osn_fctimp]
    cov_type:  # (3)!
      source: nlcd_osn_lndcov
      variable: LndCov
      statistic: categorical
      year: [2021]
      description: "Vegetation cover type (0=bare, 1=grasses, 2=shrubs, 3=trees, 4=coniferous)"
    hru_percent_imperv:
      source: nlcd_osn_fctimp
      variable: FctImp
      statistic: mean
      year: [2021]
      description: "Impervious surface fraction"

  snow:
    available: [snodas]
    snarea_thresh:
      source: snodas
      variable: SWE
      statistic: mean
      time_period: ["2020-01-01", "2021-12-31"]
      description: "Snow depletion threshold (calibration seed from historical max SWE)"

  waterbodies:
    available: []  # (4)!

forcing:
  available: [gridmet]
  prcp:  # (5)!
    source: gridmet
    variable: pr
    statistic: mean
    description: "Daily precipitation"
  tmax:
    source: gridmet
    variable: tmmx
    statistic: mean
    description: "Daily maximum temperature"
  tmin:
    source: gridmet
    variable: tmmn
    statistic: mean
    description: "Daily minimum temperature"

climate_normals:  # (6)!
  available: [gridmet]
  jh_coef:
    source: gridmet
    variables: [tmmx, tmmn]
    description: "Jensen-Haise PET coefficient (monthly, from tmax/tmin normals)"
  transp_beg:
    source: gridmet
    variable: tmmn
    description: "Month transpiration begins (from monthly mean tmin threshold)"
  transp_end:
    source: gridmet
    variable: tmmn
    description: "Month transpiration ends (from monthly mean tmin threshold)"

calibration:
  generate_seeds: true
  seed_method: physically_based
  preserve_from_existing: []

parameter_overrides:
  values: {}

output:
  path: models/pywatershed
  format: netcdf
  parameter_file: parameters.nc
  forcing_dir: forcing
  control_file: control.yml
  soltab_file: soltab.nc

Relative to the config file's parent directory. Points to the Phase 1 pipeline output containing .manifest.yml and SIR data files.
soil_type uses variables (plural) because classification requires sand, silt, and clay percentages together.
cov_type uses statistic: categorical because vegetation type is derived from NLCD class fractions via grouped majority, not a continuous mean.
No waterbody dataset is used here. Depression storage and HRU type default to zero when waterbody_path is not provided.
Forcing entries map SIR variable names (metric: pr in mm, tmmx/tmmn in degC) to PRMS parameters (imperial: inches, degF). Unit conversion happens automatically during derivation.
climate_normals can use a different source than forcing. For example, forcing from CONUS404-BA but normals from gridMET for better long-term temperature statistics.

Dataset registry¶

hydro-param ships a bundled dataset registry containing metadata for all supported datasets. The registry defines fetch strategies, STAC endpoints, variable names, CRS, and resolution for each dataset.

Bundled datasets¶

The bundled registry is organized by category in src/hydro_param/data/datasets/:

datasets/
  climate.yml         # gridMET, Daymet, PRISM, CONUS404-BA
  geology.yml         # (reserved)
  hydrography.yml     # GFv1.1 rasters
  land_cover.yml      # NLCD (6 OSN collections)
  snow.yml            # SNODAS
  soils.yml           # gNATSGO, POLARIS
  topography.yml      # 3DEP DEM
  water_bodies.yml    # (reserved)

List all available datasets with:

$ hydro-param datasets list

Get detailed info for a specific dataset:

$ hydro-param datasets info dem_3dep_10m

User-local overlay¶

You can extend the registry without modifying the package by placing YAML files in ~/.hydro-param/datasets/. Files in this directory are merged on top of the bundled registry at load time, allowing you to:

Add custom local datasets
Override fetch strategies for existing datasets
Register new STAC collections

The overlay follows the same schema as the bundled registry files. For example, the hydro-param datasets download gfv11 command automatically registers GFv1.1 rasters via a user-local overlay file.

Overlay precedence

User-local entries override bundled entries with the same dataset name. This lets you point a dataset at a local file path instead of a remote service for faster processing or offline use.

Config reference¶

For complete field-level documentation, see the API reference:

Pipeline config (Phase 1):

PipelineConfig --- Top-level pipeline configuration
TargetFabricConfig --- Target polygon mesh
DomainConfig --- Spatial extent
DatasetRequest --- Dataset selection
OutputConfig --- Output format and location
ProcessingConfig --- Batching and fault tolerance

pywatershed run config (Phase 2):

PywatershedRunConfig --- Top-level pywatershed configuration
PwsDomainConfig --- Domain fabric paths
PwsTimeConfig --- Simulation time period
StaticDatasetsConfig --- Static parameter declarations
ForcingConfig --- Forcing time series
ClimateNormalsConfig --- Climate-derived parameters
ParameterEntry --- Individual parameter declaration
PwsOutputConfig --- Output file layout

Configuration¶

Pipeline config (Phase 1)¶

target_fabric¶

datasets¶

output¶

processing¶

domain (optional)¶

Complete example¶

pywatershed run config (Phase 2)¶

sir_path¶

domain¶

time¶

static_datasets¶

forcing¶

climate_normals¶

calibration¶

parameter_overrides¶

output¶

Complete example¶

Dataset registry¶

Bundled datasets¶

User-local overlay¶

Config reference¶

`target_fabric`¶

`datasets`¶

`output`¶

`processing`¶

`domain` (optional)¶

`sir_path`¶

`domain`¶

`time`¶

`static_datasets`¶

`forcing`¶

`climate_normals`¶

`calibration`¶

`parameter_overrides`¶

`output`¶