Skip to content

Output Format

This page describes the format and meaning of all BactScout output files.

CSV Output Files

final_summary.csv (Batch Summary)

Generated by the summary command and the qc command. This is a one-row-per-sample consolidated batch summary. The table below lists the canonical columns and their meanings in the order produced by the code (examples in the repo follow this ordering).

Column Type Description
sample_id String Sample identifier (extracted from FASTQ filename)
a_final_status String Overall QC result (PASSED / WARNING / FAILED)
adapter_detection_status String PASSED/WARNING/FAILED for adapter overrepresentation detection
contamination_status String PASSED/WARNING/FAILED for contamination assessment
species_status String PASSED/WARNING/FAILED for species purity/assignment
coverage_status String PASSED/WARNING/FAILED for primary coverage metric
coverage_alt_status String PASSED/WARNING/FAILED for any alternate coverage check (if used)
duplication_status String PASSED/WARNING/FAILED for duplication metric
gc_content_status String PASSED/WARNING/FAILED for GC content check
mlst_status String PASSED/WARNING/FAILED for MLST calling and validity
n_content_status String PASSED/WARNING/FAILED for N-content
read_length_status String PASSED/WARNING/FAILED for read length checks
read_q30_status String PASSED/WARNING/FAILED for Q30 metric
species String Reported species (top or semicolon-separated list)
species_abundance String Semicolon-separated abundances (%) for reported species
species_coverage String Semicolon-separated Sylph coverage estimates for reported species
species_message String Human-readable species summary message (e.g. "Single species detected.")
contamination_message String Human-readable contamination explanation
coverage_estimate_sylph Float Sylph-derived coverage estimate for top species (x)
coverage_estimate_sylph_message String Human-readable coverage decision explanation (Sylph)
coverage_estimate_qualibact Float Alternate coverage estimate (qualibact / total_bases / expected_genome_size)
coverage_estimate_qualibact_message String Human-readable message for alternate coverage check
duplication_rate Float Fraction of duplicate reads (0-1)
duplication_message String Human-readable duplication explanation
gc_content Float Sample GC percentage (e.g. 56.93)
gc_content_lower Float Lower expected GC bound for reference/species (if available)
gc_content_upper Float Upper expected GC bound for reference/species (if available)
gc_content_message String GC content diagnostic message
n_content_rate Float Percentage of bases with ambiguous calls (N)
n_content_message String Human-readable N-content message
mlst_st String/Integer MLST sequence type (if available)
mlst_message String Human-readable MLST message (e.g. "Valid ST found: 530")
read1_mean_length Integer Mean read length for R1 (bp)
read2_mean_length Integer Mean read length for R2 (bp)
read_length_message String Diagnostic message for read length checks
read_q20_bases Integer Number of bases with Q≥20
read_q20_rate Float Fraction (0-1) of bases with Q≥20
read_q30_bases Integer Number of bases with Q≥30
read_q30_rate Float Fraction (0-1) of bases with Q≥30
read_q30_message String Human-readable Q30 message
read_total_bases Integer Total number of bases processed
read_total_reads Integer Total number of reads processed
adapter_detection_message String Human-readable adapter detection message
ref_genome String Reference genome accession extracted from genome_file_path (e.g. GCF_000742135.1). Prefer this field for programmatic reference.
genome_file_path String Path to the reference genome file used (if any)
genome_size_expected Float Expected genome size (bp) from metrics database (may be float)

Note: the repo also produces many per-sample "message" fields (e.g. *_message) and alternate metrics (coverage_alt_*, GC bounds) which are included to give human-readable diagnostics alongside numeric values. If you need a programmatic contract, use the canonical header produced by the qc/collect code path (see thread.blank_sample_results() / write_summary_file() in the source for the authoritative ordering).

Resource Monitoring (Optional)

If the --report-resources flag is enabled during qc or collect, the following additional columns are included in the output CSV files to report resource usage statistics:

Column Unit Description
resource_threads_peak Integer Peak number of threads used
resource_memory_peak_mb Float Peak memory usage in MB
resource_memory_avg_mb Float Average memory usage in MB
resource_duration_sec Float Total analysis duration in seconds

See Quality Control Guide for interpretation guidelines.

Per-sample artifacts

When the pipeline runs for a single sample it produces a set of per-sample artifacts in the sample output folder (e.g. output/<sample_id>/). Below are the common files and their purpose:

  • <sample_id>.fastp.json — Machine-readable fastp JSON report parsed by the pipeline to populate summary fields.
  • <sample_id>_summary.csv — Single-row CSV summary for the sample. Contains the canonical fields (see table above) and resource metrics when --report-resources is used.
  • mlst.tsv — Tab-separated MLST results (stringMLST) for the sample; contains ST and any per-locus notes.
  • sylph_report.txt — Raw Sylph species-detection output used to harvest top species, abundances and Sylph coverage estimates.
  • sylph_errors.log — Any Sylph warnings or errors captured while processing the sample.