Forome

Forome Platform

About the Forome Association

We are a team of experts in data management, governance, engineering, and bioinformatics. We founded the Forome Open Genomics community to accelerate Mendelian disease diagnostics and have since expanded our focus to making research data trustworthy and verifiable for the age of AI. Our work spans reproducible data engineering, data governance, population and environmental health, clinical genomics, and semantic models of scientific evidence — and everything we build is open source.

Learn more at the Forome Association home page.

Dorieh — an open-source platform for building reproducible, verifiable data pipelines — the foundation for trustworthy data in the age of AI.
Research Data that Can Be Trusted — our SpringerBriefs book that introduces the provenance framework behind Dorieh and applies it to healthcare claims data.
AnFiSA — a fully open-source platform for variant curation in rare genetic disease, built for contributions from clinicians, researchers, and developers.
Research projects — open research that underpins and extends our platforms, including a semantic model of genetic evidence (GEM) and synthetic healthcare claims datasets for testing provenance and data-quality methods.

Dorieh

Evidence-based data for evidence-based AI.

Dorieh is our open-source platform for building reproducible, verifiable data pipelines — the foundation for trustworthy data in the age of AI.

As AI increasingly writes pipelines and models, the old basis for trust — having a human read and understand the code — no longer scales. Dorieh shifts that basis from explain → understand to formalize → validate: it makes what was actually done to the data something a machine can check, on every run.

It does this through actionable provenance:

Provenance — structured, queryable records of what happened to the data, captured automatically as workflows execute (not static narrative PDFs).
Rules — formal, machine-checkable predicates over those records, written to be read by clinicians, regulators, and governance experts, not only engineers.
Actions — compliance attestations, audit trails, quality assertions, and drift alerts produced as outputs of the pipeline itself. Compliance stops being a document and becomes a query.

Under the hood, Dorieh runs portable workflows using the Common Workflow Language (CWL) and Infrastructure as Code, so results can be reproduced on confidential data by sharing infrastructure rather than the data itself. It ships with production pipelines for population and environmental health — CMS Medicare & Medicaid claims (via ResDAC) and climate and air-pollution data — with built-in cleansing, deduplication, and quality control.

Explore: Documentation · Repository

Research Data that Can Be Trusted

The book behind Dorieh, by Michael Bouzinier, Dmitry Etin, Naeem Khoshnevis, Max Shad, and Scott Yockel:

argues for the need for a new approach to data provenance;
introduces the concept of descriptive dataflow operators; and
applies the framework to analyze healthcare claims data quality, revealing insights into inconsistencies and deficiencies.

Read it on Springer · Download the flyer (PDF)

ISBN 978-3-032-21032-6 · DOI 10.1007/978-3-032-21032-6

AnFiSA

Variant curation for rare genetic disease.

AnFiSA is an established, fully open-source computational platform for the analysis of sequencing data for rare genetic disease — a variant curation tool designed to invite and accept contributions from clinicians, researchers, and professional software developers.

Its design rests on three architectural principles:

a multidimensional DBMS for genomic data to support reproducibility;
curated decision trees adaptable to changing clinical rules; and
a crowdsourcing-friendly interface for difficult-to-diagnose cases.

Read our research article in the Journal of Biomedical Informatics:

Start using AnFiSA. The deploy repository is your starting point; the easiest way is to use docker-compose.

Key repositories: Backend & REST API · Frontend · Deployment

Documentation: User documentation · Development documentation

Research projects

Open research that underpins and extends our platforms:

genetic-evidence-model — A Semantic Model of Genetic Evidence. A framework — SHACL shapes, a curated annotation corpus, and versioned protocols — for representing genetic evidence from the biomedical literature in a form suitable for variant interpretation, automated reasoning, and AI-ready clinical infrastructure. Paper in submission.

synthetic-resdac-claims (synthmed) — generates synthetic ResDAC / Medicare claims (MEDPAR, MBSF) as FTS-conformant fixed-width files, rolling a synthetic beneficiary cohort forward year by year with realistic data-quality errors injected. This allows Dorieh pipelines to be developed and tested without confidential CMS data. Companion dataset on Zenodo (CC-BY-4.0); methods paper in submission.

Contribute or provide feedback

If you would like to participate in our projects, please reach out. We welcome contributions from clinicians, researchers, and professional software developers — including source code, documentation, Frequently Asked Questions, and proposals for new use cases. We also value feedback on existing functionality: please open a GitHub issue or contact us directly.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Forome

Forome Platform

About the Forome Association

Contents

Dorieh

Research Data that Can Be Trusted

AnFiSA

Research projects

Contribute or provide feedback

Sponsors

Pinned Loading

Repositories

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

People

Top languages

Uh oh!

Most used topics

Uh oh!