Skip to content

Whitespace

TODO-copier-package add intro about why we care about the whitespace. Say how this project was created with able-workflow-copier to address the elements of this whitespace.

Comparative Table: Projects and Their Tooling Practices

Project & Link Description Snakemake Pandera Template Conda HPC / Container Flat Files Open Data mypy Ruff / Black tox GH Actions
Mutator Mapping (eLife, 2024) Germline mutation pipeline using Snakemake + Pandera. ✅ (TSV/CSV)
CapCruncher (Simpson Lab, 2023) HPC-ready Capture-C analysis. Snakemake + Pandera, PyArrow. ✅ (SLURM / Singularity)
VPMBench (BMC Bioinfo, 2021) Variant prioritization benchmarking suite. ✅ (Docker) ✅ (CSV/VCF)
aPhyloGeo-COVID (SciPy, 2023) SARS-CoV-2 phylogeography platform w/ Pandera + Snakemake. ✅ (CSV/FASTA)
SeqNado (Milne Lab, 2025) NGS workflow using Snakemake + Pandera + SLURM.
Cookiecutter Data Snake Cookiecutter Snakemake pipeline template w/ pre-commit. ✅ (user defined)
Kedro (QuantumBlack) Python ML pipeline framework with cataloging & CLI. ❌ (Own engine) ✅ (via plugin) ✅ (Docker, plugin-based SLURM)
Simpson Lab Template Cookiecutter template for bioinformatics workflows. ◐ (User-added SLURM profile)
Bibat (2024) Bayesian analysis template w/ Pandera + Copier + Make. ❌ (Make)
MLOps Python Template Python ML template using Pandera, Ruff, Docker, mypy. ✅ (Docker) ✅ (Parquet)
Cookiecutter Data Science (v2) Popular DS project structure template (Make, no Snakemake). ◐ (Make) ◐ (Docker optional)
Khuyen Tran Template Modern DS template (Hydra, DVC, Poetry, Pandera optional). ❌ (DVC)

Whitespace: Gaps in Adoption and What They Cost

Not every project needs all of the tools listed above. But too many projects skip most of them. Across many open-source and academic workflows, we still see long scripts that assume too much, notebooks with no tests, and no easy way to rerun or scale the work. This isn’t just about preference—it affects trust. If a model can’t be retrained or rerun, it’s hard to be confident in the conclusions. If a pipeline fails silently when upstream data changes, teams waste time chasing bugs. And if CI isn’t in place, small changes can break downstream results without warning.

Adoption of validation (like Pandera) and type-checking (like mypy) remains particularly sparse. These tools provide early warnings about problems that are otherwise easy to miss. Similarly, while Snakemake is growing in popularity, many pipelines still rely on manual scripts or stitched-together notebooks, which make reproducibility harder. Data sharing practices also lag behind—few projects publish intermediate or final outputs to open repositories, which means even if the pipeline is solid, the outputs aren’t always easy to verify.

This leaves a lot of room for improvement—and opportunity. Templates that bake in best practices can help new projects get started on the right foot. Pipelines that combine validation, workflows, CI, and environment isolation reduce technical debt and make collaboration easier. Teams that invest in these foundations spend less time debugging and more time delivering value. It’s not about compliance; it’s about working smarter.

And the bar isn’t that high. A pyproject.toml with Ruff and Black, a tox.ini with a basic test suite, and a GitHub Actions file to run them—those three files alone change the trajectory of a project. Add Snakemake, Pandera, and environment pinning, and you’ve got a pipeline that others can run, inspect, and trust. The goal isn’t to make things complicated. The goal is to make the important parts easy to understand and hard to break.

Additional projects

Inspirational Copier templates

GitHub copier-template topic.

Project tox pyproject.toml pytest mkdocs Comments
python-copier-template ❌ ✅ ✅ ✅
python-template ❌ ❌ ✅ ❌ copier.yml references yamls in copier/
ss-python ❌ ✅ ✅ ❌ sphinx Has its own package for some reason
python-project-template ✅ ✅ ✅ ❌ sphinx
oca-addons-repo-template ❌ ✅ ✅ plumbum ❌