CAMELS Data Processing Pipeline

Full code

🚀 Key Features

  • Automated CAMELS downloader:

    • Downloads basin attributes, meteorological forcing data (including optional extended forcing), and observed streamflow data.
  • Data preparation pipeline:

    • Merges static basin attributes.
    • Processes Daymet, Maurer-extended, and NLDAS-extended forcing datasets.
    • Computes Potential Evapotranspiration (PET) using the Hargreaves method.
    • Calculates Runoff.
    • Generates a standardized NetCDF file (CAMELS.nc).
  • Data normalization and denormalization utilities to streamline modeling workflows.


⚙️ Environment Setup

Python Version

Python ≥ 3.9 recommended (3.12 preferred)

Dependencies

Create and activate a conda environment, then install dependencies:

conda create -n rainflow-env python=3.12 -y
conda activate rainflow-env

pip install numpy pandas xarray tqdm netCDF4

Note: The pyeto library (required for PET calculation) is not available via PyPI. It has been bundled within this repository under rainflow/data. Special thanks to the original PyETo authors for their implementation.


Quickstart Guide

Step 1: Download CAMELS Data

python 01.download_camels.py --output-dir ./data/CAMELS_raw
# To skip extended forcing datasets, use:
# python 01.download_camels.py --output-dir ./data/CAMELS_raw --no-extended

Step 2: Prepare NetCDF Data

python 02.prepare_camels.py \
  --input-dir ./data/CAMELS_raw \
  --output-dir ./data/CAMELS_processed \
  --start-date 1980-01-01 \
  --end-date 2014-12-31

📊 Example: Data Loading and Normalization

See 03.load_and_normalize_data.py for an example:

import os
import pandas as pd
import numpy as np
from types import SimpleNamespace
import rainflow
from rainflow.data.nc_reader import Preprocessor

camles_531_basin_file = os.path.join(os.path.split(rainflow.__file__)[0], "data/camels/531_basin_list.txt")
camels_station_ids = pd.read_csv(camles_531_basin_file, header=None, dtype=str).values[:, 0].tolist()

config_dict = {
    'input_nc_file': './data/CAMELS_processed/CAMELS.nc',
    'train_date_list': ["1980-10-01", "1995-09-30"],
    'val_date_list': ["1980-10-01", "1985-09-30"],
    'test_date_list': ["1995-10-01", "2010-09-30"],
    'time_series_variables': ['daymet_prcp', 'daymet_srad', 'daymet_tmax', 'daymet_tmin', 'daymet_dayl', 'daymet_vp',
                              'daymet_pet'],
    'target_variables': ['Runoff'],
    'static_variables': ['elev_mean', 'slope_mean', 'area_gages2', 'frac_forest', 'lai_max', 'lai_diff', 'gvf_max',
                         'gvf_diff', 'soil_depth_pelletier', 'soil_depth_statsgo', 'soil_porosity', 'soil_conductivity',
                         'max_water_content', 'sand_frac', 'silt_frac', 'clay_frac', 'carbonate_rocks_frac',
                         'geol_permeability', 'p_mean', 'pet_mean', 'aridity', 'frac_snow', 'high_prec_freq',
                         'high_prec_dur', 'low_prec_freq', 'low_prec_dur'],
    'station_ids': camels_station_ids,
    'add_coords': False,
}

config_dataset = SimpleNamespace(**config_dict)

prep = Preprocessor(config_dataset)
# load training data without warmup days and without scaler (to fit a new scaler)
data_dict_train = prep.load_and_process(split="train", warmup_days=0, scaler=None)
norm_x_train = data_dict_train["norm_x"]  # [basins, time, features]
norm_y_train = data_dict_train["norm_y"]  # [basins, time, targets]
norm_c_train = data_dict_train["norm_c"]  # [basins, static_features]
date_range_train = data_dict_train["date_range"]
scaler = data_dict_train["scaler"]
# raw data without normalization
raw_x = data_dict_train["raw_x"]
raw_c = data_dict_train["raw_c"]
raw_y = data_dict_train["raw_y"]

# load test data with the scaler fitted on training data
data_dict_test = prep.load_and_process(split="test", warmup_days=0, scaler=scaler)
norm_x_test = data_dict_test["norm_x"]
norm_y_test = data_dict_test["norm_y"]
norm_c_test = data_dict_test["norm_c"]

# de-normalize target variable
de_norm_y_test = prep.inverse_transform(norm_y_test)
print("Difference in test target (raw vs de-normalized):", np.nanmax(np.abs(data_dict_test["raw_y"] - de_norm_y_test)))

📜 Citation

If you use this pipeline or data in your research, please cite:

Liu, J., Bian, Y., Lawson, K., & Shen, C. (2024). Probing the limit of hydrologic predictability with the Transformer network. *Journal of Hydrology*, 637, 131389. https://doi.org/10.1016/j.jhydrol.2024.131389

Liu, J., Shen, C., O'Donncha, F., Song, Y., et al. (2025). From RNNs to Transformers: benchmarking deep learning architectures for hydrologic prediction. *EGUsphere*, 2025, 1-21. https://doi.org/10.5194/egusphere-2025-1706

🙌 Acknowledgments

Thanks to the following resources and tools that provided inspiration and support:


📄 License

This software is licensed under CC BY-NC 4.0. Academic and research use is explicitly permitted.

Written by

Jiangtao Liu

I am interested in using multiple satellite datasets, in-situ observation datasets, and reanalysis products to investigate how climate variability and human activities affect water resources. My approach integrates physics-based hydrological models with deep learning techniques, ensuring that model predictions remain both accurate and physically interpretable. In parallel, I develop BERT/GPT-based foundation models that can be fine-tuned for tasks such as streamflow forecasting, soil moisture prediction, and water quality assessment. Ultimately, I aim to deliver robust, scalable, and transparent modeling frameworks that guide decision-making from local watersheds to global scales, helping mitigate risks such as droughts, floods, and landslides in a changing climate.

Start the conversation