manifest¶
manifest
¶
Manage the pipeline manifest for incremental resume support.
Record which configuration produced each output file so that re-runs
with resume: true can skip datasets whose outputs are already
complete and whose inputs (config, registry entry, processing options)
have not changed.
The manifest is stored as .manifest.yml in the output directory
and tracks per-dataset SHA-256 fingerprints, output file paths, and
completion timestamps. A separate fabric fingerprint detects when
the target fabric file has changed, invalidating all cached results.
Notes
The manifest uses cheap file-metadata proxies (filename, mtime, size) for fabric identity and content-based SHA-256 hashing for dataset configuration identity. This avoids hashing large GeoPackage files while still detecting config changes reliably.
See Also
hydro_param.pipeline : Pipeline orchestrator that reads/writes manifests.
hydro_param.cli.run_cmd : CLI --resume flag that enables manifest use.
ManifestEntry
¶
Bases: BaseModel
Represent a single dataset's record in the pipeline manifest.
Track the configuration fingerprint, output file paths, and completion timestamp for one dataset. Used to determine whether a dataset can be skipped during resume runs.
| ATTRIBUTE | DESCRIPTION |
|---|---|
fingerprint |
SHA-256 fingerprint of the dataset request, registry entry,
variable specs, and processing config. Format:
TYPE:
|
static_files |
Mapping of variable/result names to output file paths relative to the output directory.
TYPE:
|
temporal_files |
Mapping of temporal variable names to output file paths relative to the output directory.
TYPE:
|
completed_at |
UTC timestamp when processing completed. Defaults to
TYPE:
|
SIRSchemaEntry
¶
Bases: TypedDict
Schema metadata for a single SIR variable.
| ATTRIBUTE | DESCRIPTION |
|---|---|
name |
Canonical SIR variable name (e.g.,
TYPE:
|
units |
Physical units of the variable (e.g.,
TYPE:
|
statistic |
Zonal statistic used (e.g.,
TYPE:
|
source_dataset |
Name of the pipeline dataset that produced this variable
(e.g.,
TYPE:
|
SIRManifestEntry
¶
Bases: BaseModel
Track normalized SIR output from stage 5.
Record the file paths, schema metadata, and completion time for the SIR normalization step. Used by Phase 2 (model plugins) to discover what the pipeline produced without re-running it.
| ATTRIBUTE | DESCRIPTION |
|---|---|
static_files |
Mapping of SIR variable names to file paths relative to the
output directory (e.g.,
TYPE:
|
temporal_files |
Mapping of temporal dataset keys to file paths relative to the
output directory (e.g.,
TYPE:
|
sir_schema |
SIR variable schema entries from
TYPE:
|
completed_at |
UTC timestamp when SIR normalization completed.
TYPE:
|
See Also
SIRSchemaEntry : TypedDict defining the schema entry structure.
Notes
This entry is the contract between Phase 1 (pipeline) and Phase 2
(model plugins). SIRAccessor reads these file paths to discover
available SIR variables without re-running the pipeline.
PipelineManifest
¶
Bases: BaseModel
Record what configuration produced each output file.
The manifest is the top-level structure persisted as
.manifest.yml in the output directory. It contains a fabric
fingerprint (to detect fabric changes) and per-dataset entries
(to detect config changes and verify output completeness).
| ATTRIBUTE | DESCRIPTION |
|---|---|
version |
Manifest schema version. Must be one of
TYPE:
|
fabric_fingerprint |
Fingerprint of the target fabric file (format:
TYPE:
|
entries |
Per-dataset manifest entries, keyed by dataset name.
TYPE:
|
sir |
SIR output tracking for Phase 2 consumers.
TYPE:
|
save
¶
Write the manifest atomically to {output_dir}/.manifest.yml.
Write to a temporary file first, then atomically rename. This prevents corrupt manifests from partial writes (e.g., disk-full or interrupted process).
| PARAMETER | DESCRIPTION |
|---|---|
output_dir
|
Output directory. Created if it does not exist.
TYPE:
|
Source code in src/hydro_param/manifest.py
is_fabric_current
¶
Check whether the stored fabric fingerprint matches the expected value.
| PARAMETER | DESCRIPTION |
|---|---|
expected_fingerprint
|
Fingerprint computed from the current fabric file via
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
bool
|
|
Source code in src/hydro_param/manifest.py
is_dataset_current
¶
Check whether a dataset's outputs are still valid for reuse.
A dataset is considered current when all three conditions hold:
- The dataset name exists in the manifest.
- The stored fingerprint matches the computed fingerprint (no config changes).
- All listed output files (static and temporal) exist on disk.
| PARAMETER | DESCRIPTION |
|---|---|
ds_name
|
Dataset name as it appears in the pipeline config.
TYPE:
|
fingerprint
|
SHA-256 fingerprint computed from the current dataset
request, registry entry, variable specs, and processing
config via
TYPE:
|
output_dir
|
Output directory used to resolve relative file paths.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
bool
|
|
Source code in src/hydro_param/manifest.py
load_manifest
¶
Load a manifest from disk, returning None if absent or corrupt.
Attempt to read and parse .manifest.yml from the output
directory. Filesystem errors (permissions, I/O) propagate to the
caller for actionable diagnostics. YAML parse errors and Pydantic
validation failures are caught, logged as warnings, and result in
a None return (triggering full reprocessing).
| PARAMETER | DESCRIPTION |
|---|---|
output_dir
|
Directory containing
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
PipelineManifest or None
|
Loaded manifest, or |
| RAISES | DESCRIPTION |
|---|---|
OSError
|
If the file exists but cannot be read (permissions, disk errors). |
Source code in src/hydro_param/manifest.py
fabric_fingerprint
¶
Compute a fingerprint for the target fabric file.
Return "{filename}|{mtime}|{size}|{id_field}" as a cheap proxy
for content identity without hashing large GeoPackage files.
Including id_field ensures that changing which column is used
as the feature ID (e.g., nhm_id → nhru_v1_1) invalidates
all cached results, since it affects the dimension name in every
output file.
| PARAMETER | DESCRIPTION |
|---|---|
config
|
Pipeline configuration containing the
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
str
|
Fingerprint string in the format
|
| RAISES | DESCRIPTION |
|---|---|
FileNotFoundError
|
If the fabric file does not exist at the configured path. |
Notes
The mtime-based fingerprint changes when a file is copied or restored from backup, causing unnecessary reprocessing even if content is identical. This is acceptable for MVP but may warrant content-based hashing for large production workflows.
This function deliberately avoids hashing geometry coordinates, consistent with the project's cache-by-stable-ID principle (see CLAUDE.md architectural decision 7).
Source code in src/hydro_param/manifest.py
dataset_fingerprint
¶
dataset_fingerprint(
ds_req: DatasetRequest,
entry: DatasetEntry,
var_specs: list[AnyVariableSpec],
processing: ProcessingConfig,
) -> str
Compute a SHA-256 fingerprint for a dataset processing request.
Serialize all fields that affect processing output into a canonical JSON representation and hash it with SHA-256. Any change to the dataset request, registry entry metadata, variable specifications, or processing options will produce a different fingerprint, triggering reprocessing on resume.
| PARAMETER | DESCRIPTION |
|---|---|
ds_req
|
Dataset request from the pipeline config (name, variables, statistics, year, time_period, source override).
TYPE:
|
entry
|
Registry entry for the dataset (strategy, source paths, CRS, STAC collection, etc.).
TYPE:
|
var_specs
|
Resolved variable specifications (band numbers, categorical flags) and derived variable specifications (source, method).
TYPE:
|
processing
|
Processing config (batch size).
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
str
|
Fingerprint in the format |
Notes
Deliberately excluded fields (do not affect output content):
resume, description, download,
year_range (informational only).
The JSON serialization uses sorted keys and compact separators to ensure deterministic output across Python versions.
Source code in src/hydro_param/manifest.py
make_manifest_entry
¶
make_manifest_entry(
fingerprint: str,
static_files: dict[str, Path],
temporal_files: dict[str, Path],
output_dir: Path,
) -> ManifestEntry
Create a ManifestEntry with paths stored relative to the output directory.
Convert absolute file paths to relative paths for portability (the output directory can be moved without invalidating the manifest) and stamp the entry with the current UTC time.
| PARAMETER | DESCRIPTION |
|---|---|
fingerprint
|
SHA-256 fingerprint for the dataset configuration (from
TYPE:
|
static_files
|
Mapping of variable names to absolute paths for static output files.
TYPE:
|
temporal_files
|
Mapping of variable names to absolute paths for temporal output files.
TYPE:
|
output_dir
|
Root output directory used to compute relative paths.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
ManifestEntry
|
Entry ready for insertion into |