Output Format
This page describes the format and meaning of all BactScout output files.
CSV Output Files
final_summary.csv (Batch Summary)
Generated by the summary command and the qc command. This is a one-row-per-sample consolidated batch summary. The table below lists the canonical columns and their meanings in the order produced by the code (examples in the repo follow this ordering).
| Column | Type | Description |
|---|---|---|
sample_id |
String | Sample identifier (extracted from FASTQ filename) |
a_final_status |
String | Overall QC result (PASSED / WARNING / FAILED) |
adapter_detection_status |
String | PASSED/WARNING/FAILED for adapter overrepresentation detection |
contamination_status |
String | PASSED/WARNING/FAILED for contamination assessment |
species_status |
String | PASSED/WARNING/FAILED for species purity/assignment |
coverage_status |
String | PASSED/WARNING/FAILED for primary coverage metric |
coverage_alt_status |
String | PASSED/WARNING/FAILED for any alternate coverage check (if used) |
duplication_status |
String | PASSED/WARNING/FAILED for duplication metric |
gc_content_status |
String | PASSED/WARNING/FAILED for GC content check |
mlst_status |
String | PASSED/WARNING/FAILED for MLST calling and validity |
n_content_status |
String | PASSED/WARNING/FAILED for N-content |
read_length_status |
String | PASSED/WARNING/FAILED for read length checks |
read_q30_status |
String | PASSED/WARNING/FAILED for Q30 metric |
species |
String | Reported species (top or semicolon-separated list) |
species_abundance |
String | Semicolon-separated abundances (%) for reported species |
species_coverage |
String | Semicolon-separated Sylph coverage estimates for reported species |
species_message |
String | Human-readable species summary message (e.g. "Single species detected.") |
contamination_message |
String | Human-readable contamination explanation |
coverage_estimate_sylph |
Float | Sylph-derived coverage estimate for top species (x) |
coverage_estimate_sylph_message |
String | Human-readable coverage decision explanation (Sylph) |
coverage_estimate_qualibact |
Float | Alternate coverage estimate (qualibact / total_bases / expected_genome_size) |
coverage_estimate_qualibact_message |
String | Human-readable message for alternate coverage check |
duplication_rate |
Float | Fraction of duplicate reads (0-1) |
duplication_message |
String | Human-readable duplication explanation |
gc_content |
Float | Sample GC percentage (e.g. 56.93) |
gc_content_lower |
Float | Lower expected GC bound for reference/species (if available) |
gc_content_upper |
Float | Upper expected GC bound for reference/species (if available) |
gc_content_message |
String | GC content diagnostic message |
n_content_rate |
Float | Percentage of bases with ambiguous calls (N) |
n_content_message |
String | Human-readable N-content message |
mlst_st |
String/Integer | MLST sequence type (if available) |
mlst_message |
String | Human-readable MLST message (e.g. "Valid ST found: 530") |
read1_mean_length |
Integer | Mean read length for R1 (bp) |
read2_mean_length |
Integer | Mean read length for R2 (bp) |
read_length_message |
String | Diagnostic message for read length checks |
read_q20_bases |
Integer | Number of bases with Q≥20 |
read_q20_rate |
Float | Fraction (0-1) of bases with Q≥20 |
read_q30_bases |
Integer | Number of bases with Q≥30 |
read_q30_rate |
Float | Fraction (0-1) of bases with Q≥30 |
read_q30_message |
String | Human-readable Q30 message |
read_total_bases |
Integer | Total number of bases processed |
read_total_reads |
Integer | Total number of reads processed |
adapter_detection_message |
String | Human-readable adapter detection message |
ref_genome |
String | Reference genome accession extracted from genome_file_path (e.g. GCF_000742135.1). Prefer this field for programmatic reference. |
genome_file_path |
String | Path to the reference genome file used (if any) |
genome_size_expected |
Float | Expected genome size (bp) from metrics database (may be float) |
Note: the repo also produces many per-sample "message" fields (e.g. *_message) and alternate metrics (coverage_alt_*, GC bounds) which are included to give human-readable diagnostics alongside numeric values. If you need a programmatic contract, use the canonical header produced by the qc/collect code path (see thread.blank_sample_results() / write_summary_file() in the source for the authoritative ordering).
Resource Monitoring (Optional)
If the --report-resources flag is enabled during qc or collect, the following additional columns are included in the output CSV files to report resource usage statistics:
| Column | Unit | Description |
|---|---|---|
resource_threads_peak |
Integer | Peak number of threads used |
resource_memory_peak_mb |
Float | Peak memory usage in MB |
resource_memory_avg_mb |
Float | Average memory usage in MB |
resource_duration_sec |
Float | Total analysis duration in seconds |
See Quality Control Guide for interpretation guidelines.
Per-sample artifacts
When the pipeline runs for a single sample it produces a set of per-sample artifacts in the sample output folder (e.g. output/<sample_id>/). Below are the common files and their purpose:
<sample_id>.fastp.json— Machine-readable fastp JSON report parsed by the pipeline to populate summary fields.<sample_id>_summary.csv— Single-row CSV summary for the sample. Contains the canonical fields (see table above) and resource metrics when--report-resourcesis used.mlst.tsv— Tab-separated MLST results (stringMLST) for the sample; contains ST and any per-locus notes.sylph_report.txt— Raw Sylph species-detection output used to harvest top species, abundances and Sylph coverage estimates.sylph_errors.log— Any Sylph warnings or errors captured while processing the sample.