Output Format

This page describes the format and meaning of all BactScout output files.

CSV Output Files

`final_summary.csv` (Batch Summary)

Generated by the summary command and the qc command. This is a one-row-per-sample consolidated batch summary. The table below lists the canonical columns and their meanings in the order produced by the code (examples in the repo follow this ordering).

Column	Type	Description
`sample_id`	String	Sample identifier (extracted from FASTQ filename)
`a_final_status`	String	Overall QC result (PASSED / WARNING / FAILED)
`adapter_detection_status`	String	PASSED/WARNING/FAILED for adapter overrepresentation detection
`contamination_status`	String	PASSED/WARNING/FAILED for contamination assessment
`species_status`	String	PASSED/WARNING/FAILED for species purity/assignment
`coverage_status`	String	PASSED/WARNING/FAILED for primary coverage metric
`coverage_alt_status`	String	PASSED/WARNING/FAILED for any alternate coverage check (if used)
`duplication_status`	String	PASSED/WARNING/FAILED for duplication metric
`gc_content_status`	String	PASSED/WARNING/FAILED for GC content check
`mlst_status`	String	PASSED/WARNING/FAILED for MLST calling and validity
`n_content_status`	String	PASSED/WARNING/FAILED for N-content
`read_length_status`	String	PASSED/WARNING/FAILED for read length checks
`read_q30_status`	String	PASSED/WARNING/FAILED for Q30 metric
`species`	String	Reported species (top or semicolon-separated list)
`species_abundance`	String	Semicolon-separated abundances (%) for reported species
`species_coverage`	String	Semicolon-separated Sylph coverage estimates for reported species
`species_message`	String	Human-readable species summary message (e.g. "Single species detected.")
`contamination_message`	String	Human-readable contamination explanation
`coverage_estimate_sylph`	Float	Sylph-derived coverage estimate for top species (x)
`coverage_estimate_sylph_message`	String	Human-readable coverage decision explanation (Sylph)
`coverage_estimate_qualibact`	Float	Alternate coverage estimate (qualibact / total_bases / expected_genome_size)
`coverage_estimate_qualibact_message`	String	Human-readable message for alternate coverage check
`duplication_rate`	Float	Fraction of duplicate reads (0-1)
`duplication_message`	String	Human-readable duplication explanation
`gc_content`	Float	Sample GC percentage (e.g. 56.93)
`gc_content_lower`	Float	Lower expected GC bound for reference/species (if available)
`gc_content_upper`	Float	Upper expected GC bound for reference/species (if available)
`gc_content_message`	String	GC content diagnostic message
`n_content_rate`	Float	Percentage of bases with ambiguous calls (N)
`n_content_message`	String	Human-readable N-content message
`mlst_st`	String/Integer	MLST sequence type (if available)
`mlst_message`	String	Human-readable MLST message (e.g. "Valid ST found: 530")
`read1_mean_length`	Integer	Mean read length for R1 (bp)
`read2_mean_length`	Integer	Mean read length for R2 (bp)
`read_length_message`	String	Diagnostic message for read length checks
`read_q20_bases`	Integer	Number of bases with Q≥20
`read_q20_rate`	Float	Fraction (0-1) of bases with Q≥20
`read_q30_bases`	Integer	Number of bases with Q≥30
`read_q30_rate`	Float	Fraction (0-1) of bases with Q≥30
`read_q30_message`	String	Human-readable Q30 message
`read_total_bases`	Integer	Total number of bases processed
`read_total_reads`	Integer	Total number of reads processed
`adapter_detection_message`	String	Human-readable adapter detection message
`ref_genome`	String	Reference genome accession extracted from `genome_file_path` (e.g. `GCF_000742135.1`). Prefer this field for programmatic reference.
`genome_file_path`	String	Path to the reference genome file used (if any)
`genome_size_expected`	Float	Expected genome size (bp) from metrics database (may be float)

Note: the repo also produces many per-sample "message" fields (e.g. *_message) and alternate metrics (coverage_alt_*, GC bounds) which are included to give human-readable diagnostics alongside numeric values. If you need a programmatic contract, use the canonical header produced by the qc/collect code path (see thread.blank_sample_results() / write_summary_file() in the source for the authoritative ordering).

Resource Monitoring (Optional)

If the --report-resources flag is enabled during qc or collect, the following additional columns are included in the output CSV files to report resource usage statistics:

Column	Unit	Description
`resource_threads_peak`	Integer	Peak number of threads used
`resource_memory_peak_mb`	Float	Peak memory usage in MB
`resource_memory_avg_mb`	Float	Average memory usage in MB
`resource_duration_sec`	Float	Total analysis duration in seconds

See Quality Control Guide for interpretation guidelines.

Per-sample artifacts

When the pipeline runs for a single sample it produces a set of per-sample artifacts in the sample output folder (e.g. output/<sample_id>/). Below are the common files and their purpose:

<sample_id>.fastp.json — Machine-readable fastp JSON report parsed by the pipeline to populate summary fields.
<sample_id>_summary.csv — Single-row CSV summary for the sample. Contains the canonical fields (see table above) and resource metrics when --report-resources is used.
mlst.tsv — Tab-separated MLST results (stringMLST) for the sample; contains ST and any per-locus notes.
sylph_report.txt — Raw Sylph species-detection output used to harvest top species, abundances and Sylph coverage estimates.
sylph_errors.log — Any Sylph warnings or errors captured while processing the sample.

Output Format

CSV Output Files

final_summary.csv (Batch Summary)

Resource Monitoring (Optional)

Per-sample artifacts

`final_summary.csv` (Batch Summary)