Best Practices

Best Practices this Project Implements¶

The following practices provide a solid foundation for building reproducible, maintainable, and scalable data workflows:

Define reproducible pipelines with Snakemake or similar workflow engines
Validate data explicitly using Pandera
Use Cookiecutter or Copier templates to scaffold projects and add pipeline steps
Manage dependencies with Conda (or mamba, micromamba)
Use SLURM or container-based execution for scaling
Store intermediate outputs in Parquet or other flat files
Publish data outputs to open repositories such as Zenodo (EU), Open Energy Data Initiative (OEDI, US Dept. of Energy), Kaggle, or Harvard Dataverse
Add type annotations and run mypy to catch type errors early
Enforce code quality with Ruff and Black
Run tests in isolated environments with tox
Use GitHub Actions or similar for CI and trunk-based development