A CLI tool for asserting properties of files using a simple declarative syntax. Designed for validating pipeline outputs in bioinformatics workflows.
Install with cargo (see Installation for Docker):
cargo install bioassertWrite your checks to an assertions file. Each line is <file> <metric> <comparator> <value>; blank lines and lines
beginning with # are ignored, and whitespace between fields is flexible so you can align columns for readability:
# assertions.txt
# Outputs were created and are non-trivial
output.bam file.exists eq true
output.bam file.size gt 1MB
results.vcf file.empty eq false
# The sample sheet has the columns we expect
samples.csv csv.columns.count eq 5
samples.csv csv.lines.count gte 2
# BAM header provenance is intact
output.bam bam.header.rg.count gte 1
output.bam bam.header.hd.so eq coordinate
# The reference genome has the right number of contigs
ref.fasta fasta.seq.count eq 25
Run them:
bioassert run assertions.txtrun defaults to assertions.txt when you give no path, so plain bioassert run works too.
Each assertion prints a PASS./FAIL. line, and the process exits non-zero if any assertion fails or errors, so it
gates a pipeline step cleanly:
PASS. Expected output.bam file.exists == true, got true
PASS. Expected output.bam file.size > 1MB, got 5.00MB
PASS. Expected results.vcf file.empty == false, got false
PASS. Expected samples.csv csv.columns.count == 5, got 5
PASS. Expected samples.csv csv.lines.count >= 2, got 101
PASS. Expected output.bam bam.header.rg.count >= 1, got 2
PASS. Expected output.bam bam.header.hd.so == coordinate, got coordinate
PASS. Expected ref.fasta fasta.seq.count == 25, got 25
The same lines are written to a report file (here assertions.txt.log) so they can be captured as a pipeline artifact.
See Writing assertions for more examples, Metrics for the full reference, and
Results and exit codes for how exit codes map to outcomes.
cargo install bioassertdocker pull ghcr.io/peterkneale/bioassertMount your working directory to /data and use that path prefix in your assertions. To run an assertions file:
docker run --rm -v "$PWD":/data ghcr.io/peterkneale/bioassert run /data/assertions.txtA single assertion can also be passed inline (see the assert subcommand below):
docker run --rm -v "$PWD":/data ghcr.io/peterkneale/bioassert assert "/data/output.bam file.exists eq true"bioassert [OPTIONS] <COMMAND>
| Command | Description |
|---|---|
run [file] |
Evaluate every assertion in a file (defaults to assertions.txt). Blank lines and # comments are skipped. |
assert "<...>" |
Evaluate a single assertion passed as one quoted string. |
run is the primary workflow: keep your checks in a version-controlled assertions file and run them as a pipeline step.
assert is handy for one-off checks and quick experimentation at the shell.
| Option | Default | Description |
|---|---|---|
--report-file <FILE> |
derived | Write the assertion report to FILE instead of the default location (see Output below). |
--color <WHEN> |
auto |
When to use ANSI color in console output: auto, always or never (see Color below). Also spelled --colour. |
--icons <WHEN> |
auto |
When to prefix PASS/FAIL/ERROR console lines with a status icon: auto, always or never (see Icons below). |
-h, --help |
Print help. | |
-V, --version |
Print the version. |
--report-file, --color and --icons are global, so they may appear either before or after the subcommand. For example, bioassert run checks.txt --color=never and bioassert --color=never run checks.txt are equivalent.
Result lines are written to stdout, one per assertion: PASS. when the assertion holds and FAIL. when it does not. Errors (invalid syntax, an unknown metric, an unreadable file) are written to stderr as ERROR. lines. The process exit code reflects the worst outcome across all assertions:
| Exit code | Meaning |
|---|---|
0 |
Every assertion passed. |
1 |
At least one assertion failed (but none errored). |
2 |
At least one assertion could not be evaluated. |
This makes bioassert easy to gate a pipeline step on: a non-zero exit halts the workflow.
Alongside the console output, bioassert writes an assertion report: a file built from the results of every executed assertion. It is plain text, one PASS./FAIL./ERROR. line per assertion, with no level or module prefixes, no timestamps, no ANSI color and no icons, so it stays easy to read and grep. Its location is resolved as follows:
--report-file <FILE>if given.- Otherwise for
run <file>, the derived path<file>.log(for examplechecks.txtreports tochecks.txt.log). - Otherwise
assertions.login the current directory.
For example, the report for a failing run reads:
PASS. Expected output.bam file.exists == true, got true
FAIL. Expected results.vcf file.lines > 999, got 0
--color controls ANSI color on the console only. The leading PASS keyword is colored green, and FAIL/ERROR are colored red. The assertion report file is never colored, regardless of this setting. The flag is also accepted spelled --colour.
| Value | Behaviour |
|---|---|
auto |
Color only when stdout is a terminal and the NO_COLOR environment variable is unset. This is the default, so output stays plain when piped or redirected. |
always |
Always color, even when piped. Overrides NO_COLOR. |
never |
Never color. |
auto follows the NO_COLOR convention: setting NO_COLOR to any non-empty value disables color. An explicit --color=always takes priority over it.
--icons prefixes each console result line with a status icon: π’ for PASS, π΄ for FAIL and π₯ for ERROR. It takes the same auto/always/never values as --color and follows the same resolution logic, but controls the icons rather than the color:
| Value | Behaviour |
|---|---|
auto |
Icons only when stdout is a terminal and NO_COLOR is unset. This is the default, so output stays plain when piped or redirected. |
always |
Always show icons, even when piped. Overrides NO_COLOR. |
never |
Never show icons. |
Color and icons are resolved independently, so either can be on while the other is off. The assertion report file is never iconified, so it stays easy to grep.
An assertions file has one assertion per line:
<file> <metric> <comparator> <value>
Lines beginning with # are comments and blank lines are ignored. Whitespace between fields is flexible, so columns can
be aligned for readability. Pass the file to bioassert run.
Bare values must be alphanumeric. Values that contain dots, dashes, colons, or spaces (read-group IDs, version strings,
sequence names, descriptions) must be single- or double-quoted: 'H0164.2', '1.6', 'Solexa-272222',
'NC_000001.11', 'Homo sapiens chromosome 1'.
output.bam file.exists eq true # file exists
results.vcf file.empty eq false # not empty
output.bam file.size gte 1MB # at least 1 MB
results.tsv file.lines eq 1000 # exactly 1000 lines
results.tsv file.lines gte 1 # at least one line
Replace csv with tsv (tab-separated) or psv (pipe-separated) for those formats. Lines and columns are 1-indexed.
samples.csv csv.columns.count eq 3 # 3 columns
counts.tsv tsv.lines.count gt 10 # more than 10 lines
report.psv psv.line.2.column.1 eq Alice # cell value
samples.csv csv.line.2.column.3 starts New # cell prefix
results.tsv tsv.line.2.column.2 matches '^[0-9]+$' # cell regex
junctions.tsv tsv.column.6.data.all matches '^[+-]$' # every data row in column 6
junctions.tsv tsv.column.4.all matches '^JUNC[0-9]+$' # every row, header included
column.N.all applies the comparison to every cell in column N (1-indexed) and passes only when they all hold;
column.N.data.all does the same but skips the header (line 1), which is what you usually want for a file with a
header row. A header-only or empty file passes vacuously. On failure the report names the first offending row and its
value (e.g. got line 3 = "NDA"). Any string comparator works, not just matches.
Assertions on a BAM file's SAM header live under bam.header.*. Read groups (@RG) are addressed by 0-based index.
Values containing dots, dashes or colons (read-group IDs, library names, ISO dates, version strings) must be quoted.
output.bam bam.header.rg.count gte 1 # at least one read group
output.bam bam.header.rg.0.sm eq NA12878 # sample name (callers group by this)
output.bam bam.header.rg.0.pl eq ILLUMINA # sequencing platform
output.bam bam.header.rg.0.lb eq 'Solexa-272222' # library name (quoted: dash)
output.bam bam.header.rg.1.id eq 'H0164.2' # second read-group ID (quoted: dot)
output.bam bam.header.rg.0.pu.present eq true # a platform-unit tag is set
output.bam bam.header.hd.so eq coordinate # file is coordinate-sorted
output.bam bam.header.sq.count eq 1 # exactly one reference sequence
Assertions on a FASTA file's records live under fasta.seq.*, with fasta.length as a whole-file total across all
records. Records are addressed by 0-based index, following file order. Sequence names and descriptions that contain
dots, dashes, colons, or spaces must be quoted.
ref.fasta fasta.seq.count eq 25 # 25 records
ref.fasta fasta.length gte 3000000000 # total bases across all records
ref.fasta fasta.seq.0.name eq chr1 # first record's name
ref.fasta fasta.seq.0.length gte 20000000 # first record's length
ref.fasta fasta.seq.0.description eq 'Homo sapiens chromosome 1' # description (quoted: spaces)
ref.fasta fasta.seq.0.name matches '^chr[0-9XYM]+$' # name matches a regex
ref.fasta fasta.seq.24.present eq true # a record exists at this index
For a one-off check you can skip the file and pass a single assertion to the assert subcommand:
bioassert assert "output.bam file.size gte 1MB"On a terminal, auto shows both by default. To force them on (for example when piping into a pager that understands ANSI), use --color=always --icons=always:
bioassert --color=always --icons=always run checks.txtEach result line is prefixed with its status icon, and the leading keyword is colored (green PASS, red FAIL/ERROR):
π’ PASS. Expected output.bam file.exists == true, got true
π΄ FAIL. Expected results.vcf file.lines > 999, got 0
π₯ ERROR. unknown metric: file.explode
(PASS/FAIL lines go to stdout and ERROR lines to stderr; the icons render here, but the green/red keyword coloring only shows on an ANSI-capable terminal.)
bioassert works well as a validation step in a Nextflow pipeline. Write the
assertions to a file, then run them with the run subcommand. The --report-file flag persists the report
so it can be captured as a process output, and a non-zero exit code (1 for a failed assertion, 2 for an error)
fails the task automatically.
process REFERENCE_GENOME_ANNOTATIONS_ASSERTIONS {
tag "reference_genome_annotations"
label 'process_medium'
conda "${moduleDir}/environment.yml"
container "ghcr.io/peterkneale/bioassert:1.4.0"
input:
path(annotation)
path(annotation_decompressed)
output:
path("*.log"), emit: log
tuple val("${task.process}"), val('bioassert'), eval('bioassert --version | sed "1!d;s/.* //"'), emit: versions_bioassert, topic: versions
script:
"""
cat <<EOF > assertions.txt
${annotation} file.exists eq true
${annotation} file.size gt 10MB
${annotation_decompressed} file.exists eq true
${annotation_decompressed} file.size gt 50MB
EOF
bioassert \\
--report-file reference_genome_annotations_assertions.log \\
run assertions.txt
"""
}Notes:
- Assertions reference the staged file paths directly. The unquoted heredoc (
<<EOF) lets Nextflow interpolate${annotation}into the real filename, and the grammar accepts dotted names likegenome.gff.gzunquoted. - Global flags such as
--report-filego before therunsubcommand; the assertions file is its positional argument.
<file> <metric> <comparator> <value>
| Metric | Description | Comparators | Value |
|---|---|---|---|
file.exists |
Whether the file exists | eq, ne |
boolean |
file.empty |
Whether the file is zero bytes | eq, ne |
boolean |
file.size |
File size | eq, ne, lt, lte, gt, gte |
size |
file.lines |
Line count | eq, ne, lt, lte, gt, gte |
count |
| Metric | Description |
|---|---|
csv.columns.count |
Number of columns in the first (header) row |
csv.lines.count |
Number of lines in the file |
csv.line.N.column.M |
Content of cell at line N, column M (1-indexed) |
csv.column.N.all |
Holds for every cell in column N, header included (1-indexed) |
csv.column.N.data.all |
Holds for every cell in column N, skipping the header row |
Replace csv with tsv (tab-separated) or psv (pipe-separated) for those formats. The column.N.all /
column.N.data.all metrics accept any string comparator (eq, ne, starts, ends, contains, matches) and
pass only when it holds for every checked cell; a header-only or empty file passes vacuously.
Metrics on a BAM file's SAM header, all under the bam.header.* namespace. Read groups (@RG) are addressed by
0-based index N, following header order. <tag> is a lowercased 2-letter SAM tag.
| Metric | Description | Comparators | Value |
|---|---|---|---|
bam.header.rg.count |
Number of @RG read groups |
eq, ne, lt, lte, gt, gte |
count |
bam.header.sq.count |
Number of @SQ reference sequences |
eq, ne, lt, lte, gt, gte |
count |
bam.header.pg.count |
Number of @PG program records |
eq, ne, lt, lte, gt, gte |
count |
bam.header.rg.N.<tag> |
Tag value of read group N (id, sm, lb, pl, pu, ...) |
eq, ne, starts, ends, contains, matches |
string |
bam.header.rg.N.present |
Whether read group N exists | eq, ne |
boolean |
bam.header.rg.N.<tag>.present |
Whether <tag> is set on read group N |
eq, ne |
boolean |
bam.header.hd.vn |
@HD format version (VN) |
eq, ne, starts, ends, contains, matches |
string |
bam.header.hd.so |
@HD sort order (SO), e.g. coordinate |
eq, ne, starts, ends, contains, matches |
string |
id resolves to the read-group identifier (the @RG ID field); other tags resolve to the read group's
remaining fields. Reading a tag value (or @HD field) that is not set, or a read group whose index is out of
range, is an error; use the .present form to test for presence without erroring. Quote values that contain
dots, dashes or colons ('H0164.2', 'Solexa-272222', '1.6').
Metrics on a FASTA file's records, under the fasta.seq.* namespace for per-record metrics plus the fasta.length
whole-file aggregate. Records are addressed by 0-based index N, following file order.
| Metric | Description | Comparators | Value |
|---|---|---|---|
fasta.seq.count |
Number of sequence records | eq, ne, lt, lte, gt, gte |
count |
fasta.length |
Total bases summed across all records | eq, ne, lt, lte, gt, gte |
count |
fasta.seq.N.name |
Name (ID) of record N β the first whitespace-delimited header token | eq, ne, starts, ends, contains, matches |
string |
fasta.seq.N.description |
Description of record N β header text after the name | eq, ne, starts, ends, contains, matches |
string |
fasta.seq.N.length |
Length in bases of record N's sequence | eq, ne, lt, lte, gt, gte |
count |
fasta.seq.N.present |
Whether a record exists at index N | eq, ne |
boolean |
fasta.seq.N.description.present |
Whether record N has a (non-empty) description | eq, ne |
boolean |
name is always present for a record; description is optional. Reading the name, description, or length of a record
whose index is out of range is an error; use the .present form to test for presence without erroring. Quote names
and descriptions that contain dots, dashes, colons or spaces ('NC_000001.11', 'Homo sapiens chromosome 1').
| Comparator | Meaning | Use with |
|---|---|---|
eq |
equal | any |
ne |
not equal | any |
lt |
less than | size, count |
lte |
less or equal | size, count |
gt |
greater than | size, count |
gte |
>= | size, count |
starts |
starts with | string |
ends |
ends with | string |
contains |
contains | string |
matches |
regex match | string |
| Type | Examples |
|---|---|
| boolean | true, false |
| size | 5B, 1KB, 2MB, 1GB |
| count | 0, 10, 1000 |
| string | Alice, "New York", 'hello' |