sir¶
sir
¶
SIR normalization: canonical naming, unit conversion, and schema validation.
Normalize raw gdptools output into the Standardized Internal Representation (SIR) -- self-documenting variable names with canonical SI-like units. The SIR is the boundary between the generic pipeline (stages 1-5) and model plugins: plugins consume only SIR data and never see raw gdptools output.
This module handles three concerns:
- Naming: Convert raw source variable names (e.g.,
"elevation") into canonical SIR names (e.g.,"elevation_m_mean"). - Unit conversion: Transform source units to canonical units at the SIR boundary (e.g., log10(cm/hr) to cm/hr, Kelvin to degrees Celsius).
- Validation: Check completeness, NaN coverage, and value ranges.
References
.. [1] docs/plans/2026-02-23-sir-normalization-design.md -- Full design document.
See Also
hydro_param.units : Model-specific unit conversions (SI to imperial). hydro_param.plugins : Plugin protocols that consume SIR output.
SIRVariableSchema
dataclass
¶
SIRVariableSchema(
canonical_name: str,
source_name: str,
source_units: str,
canonical_units: str,
long_name: str,
categorical: bool,
valid_range: tuple[float, float] | None,
conversion: str | None,
temporal: bool = False,
dataset_name: str = "",
)
Describe the expected schema for a single SIR output variable.
Each schema entry records the mapping from a raw source variable to its
canonical SIR name, the unit conversion to apply, and validation
constraints. Used by normalize_sir() and validate_sir() to
transform and check pipeline output.
| ATTRIBUTE | DESCRIPTION |
|---|---|
canonical_name |
Canonical SIR variable name (e.g.,
TYPE:
|
source_name |
Original variable name from the dataset registry.
TYPE:
|
source_units |
Units of the raw source data (e.g.,
TYPE:
|
canonical_units |
Units after SIR normalization (e.g.,
TYPE:
|
long_name |
Human-readable description for metadata / NetCDF attributes.
TYPE:
|
categorical |
TYPE:
|
valid_range |
Expected
TYPE:
|
conversion |
Conversion type to apply:
TYPE:
|
temporal |
TYPE:
|
dataset_name |
Pipeline dataset registry key (e.g.,
TYPE:
|
SIRValidationWarning
dataclass
¶
Represent a single SIR validation warning.
| ATTRIBUTE | DESCRIPTION |
|---|---|
variable |
Canonical variable name or column name that triggered the warning.
TYPE:
|
check_type |
Category of the check:
TYPE:
|
message |
Human-readable description of the issue.
TYPE:
|
SIRValidationError
¶
Bases: Exception
Raise when SIR validation fails in strict mode.
Wraps one or more SIRValidationWarning instances into a single
exception with a formatted multi-line message listing each warning.
| ATTRIBUTE | DESCRIPTION |
|---|---|
warnings |
All validation warnings that triggered the error.
TYPE:
|
Source code in src/hydro_param/sir.py
unit_abbreviation
¶
Return the canonical abbreviation for a unit string.
Map a source unit string to a short, filesystem-safe abbreviation for
use in SIR variable names. Known units are looked up in _UNIT_TABLE;
unknown units are slugified by replacing non-alphanumeric characters with
underscores.
| PARAMETER | DESCRIPTION |
|---|---|
units
|
Source unit string from the dataset registry (e.g.,
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
str
|
Short canonical abbreviation (e.g., |
Examples:
>>> unit_abbreviation("log10(cm/hr)")
'cm_hr'
>>> unit_abbreviation("%")
'pct'
>>> unit_abbreviation("")
''
Source code in src/hydro_param/sir.py
canonical_name
¶
Generate a canonical SIR variable name.
Assemble a self-documenting variable name following the pattern
<base>_<unit_abbrev>_<stat>. Dimensionless quantities omit the
unit segment: <base>_<stat>. All base names are lowercased.
| PARAMETER | DESCRIPTION |
|---|---|
name
|
Base variable name from the dataset registry (e.g.,
TYPE:
|
units
|
Source unit string (e.g.,
TYPE:
|
stat
|
Aggregation statistic (e.g.,
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
str
|
Canonical SIR variable name (e.g., |
Examples:
>>> canonical_name("elevation", "m", "mean")
'elevation_m_mean'
>>> canonical_name("land_cover", "", "frac")
'land_cover_frac'
Source code in src/hydro_param/sir.py
build_sir_schema
¶
build_sir_schema(
resolved: Sequence[
tuple[object, DatasetRequest, list[AnyVariableSpec]]
],
) -> list[SIRVariableSchema]
Auto-generate the SIR schema from stage 2 resolved datasets.
Walk the resolved dataset/variable tuples produced by
stage2_resolve_datasets() and create one SIRVariableSchema entry
per expected output column. For continuous variables, one entry is
created per (variable, statistic, year) combination. For categorical
variables, a single _frac entry is created per (variable, year)
-- individual fraction columns (e.g., land_cover_frac_42) are
generated dynamically during normalization.
| PARAMETER | DESCRIPTION |
|---|---|
resolved
|
Output of
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
list[SIRVariableSchema]
|
One schema entry per expected SIR output column. Multi-year
datasets produce year-suffixed entries (e.g.,
|
Notes
The schema is deterministic for a given set of resolved datasets: same input always produces the same schema entries in the same order.
Source code in src/hydro_param/sir.py
222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 | |
apply_conversion
¶
Apply a SIR unit conversion to an array of values.
Transform raw source values into canonical SIR units. This handles
conversions that occur at the pipeline/SIR boundary -- distinct from
the model-specific conversions in hydro_param.units which occur
inside derivation plugins.
| PARAMETER | DESCRIPTION |
|---|---|
values
|
Input values in source units.
TYPE:
|
conversion
|
Conversion type:
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
NDArray[floating]
|
Converted values in canonical SIR units. |
| RAISES | DESCRIPTION |
|---|---|
ValueError
|
If |
See Also
hydro_param.units.convert : Model-specific unit conversions (e.g., metres to feet).
Source code in src/hydro_param/sir.py
normalize_sir
¶
normalize_sir(
raw_files: dict[str, Path],
schema: list[SIRVariableSchema],
output_dir: Path,
id_field: str,
) -> dict[str, Path]
Normalize raw per-variable CSV files to canonical SIR format.
Read raw CSVs produced by stage 4 (gdptools zonal statistics), rename
columns to canonical SIR names, apply unit conversions (e.g.,
log10-to-linear, Kelvin-to-Celsius), and write normalized per-variable
CSVs to output_dir/.
Categorical variables (e.g., NLCD land cover) have their fraction
columns renamed from <source>_<class> to <source>_frac_<class>.
Continuous variables are matched by statistic suffix and renamed to the
canonical <base>_<unit>_<stat> pattern.
| PARAMETER | DESCRIPTION |
|---|---|
raw_files
|
Mapping of source variable key (e.g.,
TYPE:
|
schema
|
SIR variable schema entries from
TYPE:
|
output_dir
|
Directory to write normalized CSV files. Created if absent.
TYPE:
|
id_field
|
Feature ID column name from
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
dict[str, Path]
|
Mapping of canonical SIR name to normalized CSV file path. Variables that could not be matched or were missing columns are omitted and logged as warnings. |
Warnings
Logs a warning for each raw variable that has no matching schema entry or whose expected column is missing from the raw CSV. A summary warning is logged at the end if any variables were skipped.
Notes
Year-suffixed keys (e.g., "elevation_2020") are matched to their
base schema entry by stripping the _YYYY suffix. This supports
multi-year static datasets where each year produces a separate raw file.
Source code in src/hydro_param/sir.py
395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 | |
normalize_sir_temporal
¶
normalize_sir_temporal(
temporal_files: dict[str, Path],
schema: list[SIRVariableSchema],
resolved: Sequence[
tuple[object, DatasetRequest, list[AnyVariableSpec]]
],
output_dir: Path,
) -> dict[str, Path]
Normalize temporal NetCDF files to canonical SIR format.
Read raw temporal NetCDFs from stage 4 (gdptools WeightGen/AggGen
output), rename data variables from native source names (OPeNDAP/CF
variable names like "daily_mean_shortwave_radiation_at_surface")
to canonical SIR names, apply unit conversions, and write normalized
per-variable NetCDFs.
A reverse lookup table maps native variable names to their corresponding
VariableSpec and SIRVariableSchema entries, using the
native_name field from the dataset registry.
| PARAMETER | DESCRIPTION |
|---|---|
temporal_files
|
Mapping of dataset key (e.g.,
TYPE:
|
schema
|
SIR variable schema entries from
TYPE:
|
resolved
|
Resolved dataset entries from stage 2. Used to build the native name reverse lookup.
TYPE:
|
output_dir
|
Directory to write normalized NetCDF files. Created if absent.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
dict[str, Path]
|
Mapping of canonical name (with year suffix if applicable, e.g.,
|
Warnings
Logs a warning if a native variable name appears in multiple
VariableSpec entries (collision), or if a data variable in the
raw NetCDF has no matching schema entry.
Notes
Year suffixes are extracted from the temporal_files keys (e.g.,
"gridmet_2020" yields "_2020") and appended to canonical
names to prevent multi-year collisions in the output mapping.
Source code in src/hydro_param/sir.py
578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 | |
validate_sir
¶
validate_sir(
sir_files: dict[str, Path],
schema: list[SIRVariableSchema],
*,
strict: bool = False,
) -> list[SIRValidationWarning]
Validate normalized SIR files against the expected schema.
Perform three categories of checks on the normalized SIR output:
- Completeness -- every schema variable has a corresponding file.
- NaN coverage -- warn if any variable is 100% NaN (likely a processing failure).
- Value range -- warn if values fall outside the schema's
valid_range(e.g., fractions outside [0, 1]).
Temporal NetCDF files (.nc) are checked for completeness but
skipped for NaN/range checks (CSV-only validation).
| PARAMETER | DESCRIPTION |
|---|---|
sir_files
|
Mapping of dataset-prefixed key to normalized file path (CSV or
NetCDF). Keys use the format
TYPE:
|
schema
|
Expected SIR variable schema entries from
TYPE:
|
strict
|
If
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
list[SIRValidationWarning]
|
Validation warnings. Empty list means all checks passed. |
| RAISES | DESCRIPTION |
|---|---|
SIRValidationError
|
If |
Notes
This function aligns with the project's fault-tolerance strategy:
production runs use strict=False (warnings are logged but
processing continues), while development/debugging uses
strict=True to catch issues early.
Source code in src/hydro_param/sir.py
747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 | |