Configuration
BactScout uses a YAML configuration file to define analysis parameters and thresholds.
Default Configuration
The default bactscout_config.yml (project root) is the single source of truth for
thresholds, database locations, and system requirements. Below is the current
configuration shipped with this repository:
bactscout_dbs_path: 'bactscout_dbs'
sylph_db: 'gtdb-r226-c1000-dbv1.syldb'
metrics_file: 'filtered_metrics.csv'
sylph_db_url: 'http://faust.compbio.cs.cmu.edu/sylph-stuff/gtdb-r226-c1000-dbv1.syldb'
# QC Thresholds - Support both WARN and FAIL levels
# Coverage thresholds (in x)
coverage_warn_threshold: 30
coverage_fail_threshold: 20
# Contamination thresholds (% of top species, lower is more contaminated)
contamination_warn_threshold: 5
contamination_fail_threshold: 10
# Q30 thresholds (fraction of bases >= Q30)
q30_warn_threshold: 0.80
q30_fail_threshold: 0.70
# Read length thresholds (mean read length in bp)
read_length_warn_threshold: 80
read_length_fail_threshold: 100
# Duplication rate thresholds (fraction of duplicate reads)
duplication_warn_threshold: 0.20
duplication_fail_threshold: 0.30
# GC content failure threshold (% of reads with unexpected GC content)
gc_fail_percentage: 5
# N-content threshold (fraction of reads with too many N's)
n_content_threshold: 0.001
# Adapter overrepresented sequences threshold (number of overrepresented sequences)
adapter_overrep_threshold: 5
mlst_species:
escherichia_coli: 'Escherichia coli#1'
salmonella_enterica: 'Salmonella enterica'
klebsiella_pneumoniae: 'Klebsiella pneumoniae'
acinetobacter_baumannii: 'Acinetobacter baumannii#1'
pseudomonas_aeruginosa: 'Pseudomonas aeruginosa'
system_resources:
cpus: 2
memory: 4.GB
Custom Configuration
Create a custom config and pass it to BactScout:
cp bactscout_config.yml my_config.yml
# Edit my_config.yml as needed
pixi run bactscout qc data/ -c my_config.yml
Configuration Parameters
Database Settings
| Parameter | Default | Description |
|---|---|---|
bactscout_dbs_path |
bactscout_dbs |
Directory for storing reference databases |
sylph_db |
gtdb-r226-c1000-dbv1.syldb |
Sylph GTDB database filename |
metrics_file |
filtered_metrics.csv |
Species-specific genome size and GC content metrics |
sylph_db_url |
[See config] | URL to download Sylph database if not found |
Using alternative Sylph databases
Sylph provides a set of pre-built reference databases targeting different trade-offs of sensitivity, specificity and runtime (see the Sylph pre-built databases page.
- The default database shipped in BactScout (
gtdb-r226-c1000-dbv1.syldb) is compact and fast, and works well for many routine surveillance tasks, but it may be smaller than the largest available references and therefore slightly less sensitive for rare or unusual species. - Larger, more comprehensive Sylph databases (for example full GTDB builds or RefSeq-style databases) include many more taxa and yield higher sensitivity at the cost of increased disk usage, higher memory requirements, and longer profiling runtimes.
If you want to use an alternative Sylph database, update these fields in bactscout_config.yml:
sylph_db: 'my-large-db.syldb'
sylph_db_url: 'https://example.org/path/to/my-large-db.syldb'
bactscout_dbs_path: '/path/to/local/dbs'
Notes and recommendations:
- You can point
sylph_db_urlat any HTTP(S)-accessible.syldbfile; thepreflightcommand will try to download it intobactscout_dbs_pathwhen missing. - Expect larger
.syldbfiles to require more disk space (tens to hundreds of GB for very large RefSeq/GTDB builds) and more RAM during profiling. Test on a small subset first to measure performance impact. - If you are running BactScout inside containers, ensure the database path is mounted into the container and that permissions allow the process to read the file.
- If you need the highest sensitivity for taxonomic assignment, pick one of the larger pre-built databases from the Sylph docs, but accept that profiling will take longer.
Example: switch to a larger GTDB build (pseudo-URL):
sylph_db: 'gtdb-r226-full.syldb'
sylph_db_url: 'https://sylph-docs.github.io/pre-built-databases/gtdb-r226-full.syldb'
Use the pixi run bactscout preflight command after updating the config to validate the database download and ensure tool availability.
Quality Control Thresholds
Many thresholds are expressed as WARN/FAIL pairs; tests and the CLI use these separately to produce PASS/WARNING/FAIL decisions.
| Parameter | Default | Description |
|---|---|---|
coverage_warn_threshold |
30 | Coverage (×) above which samples are considered OK (warning threshold) |
coverage_fail_threshold |
20 | Coverage (×) below which samples are considered FAIL |
contamination_warn_threshold |
5 | Contamination (%) warning threshold (percent of reads not from dominant species) |
contamination_fail_threshold |
10 | Contamination (%) fail threshold |
q30_warn_threshold |
0.80 | Fraction of bases with Q ≥ 30 for WARN |
q30_fail_threshold |
0.70 | Fraction of bases with Q ≥ 30 for FAIL |
read_length_warn_threshold |
80 | Mean read length (bp) WARN threshold |
read_length_fail_threshold |
100 | Mean read length (bp) FAIL threshold |
duplication_warn_threshold |
0.20 | Fraction duplicate reads WARN |
duplication_fail_threshold |
0.30 | Fraction duplicate reads FAIL |
gc_fail_percentage |
5 | The GC ranges are determined via Qualibact for a known species. These values are used to WARN. This option controlls the adjustment to the cutoff percent with unexpected GC content that triggers FAIL |
n_content_threshold |
0.001 | Fraction of reads with too many N's that triggers FAIL |
adapter_overrep_threshold |
5 | Number of overrepresented adapters before warning/failing |
Other QC applied (auto-determined)
In addition to the thresholds above, BactScout applies several QC checks that are automatically determined from species assignment and the metrics database:
- Genome size — obtained from the
metrics_file(QualiBact-derived values) and the predicted species from Sylph; used to compute an estimated coverage when Sylph-derived coverage is unavailable. - GC ranges — species-specific GC lower/upper bounds are read from the QualiBact-derived metrics file. A sample GC within the species bounds is considered PASS; values slightly outside the bounds trigger a WARNING; values outside the bounds plus the configured
gc_fail_percentagebuffer are flagged as FAIL.
These values are inferred at runtime (species + metrics) and do not require manual setting in the config.
MLST Species
Define species with available MLST schemes:
mlst_species:
escherichia_coli: 'Escherichia coli#1' # Species directory: escherichia_coli
salmonella_enterica: 'Salmonella enterica' # Species directory: salmonella_enterica
The key is used as the database directory name, the value is the scientific name for species matching.
For a complete list of PUBMLST-format names supported by BactScout, see the dedicated page and raw list included with the docs:
Use the exact PUBMLST-format key as the directory name under
bactscout_dbs/ when adding MLST databases.
To add MLST support for a new species: 1. Prepare MLST databases following PUBMLST format 2. Add to config:
mlst_species:
my_species: 'Genus species'
bactscout_dbs/my_species/
4. Run BactScout normally
System Resources
system_resources:
cpus: 2 # Minimum CPUs required
memory: 4.GB # Minimum memory required
Adjusting Thresholds
Example: Lower coverage threshold for low-depth studies
coverage_threshold: 20 # Instead of 30x
q30_pass_threshold: 0.75 # Instead of 0.80 (75% instead of 80%)
read_length_pass_threshold: 80 # Instead of 100 bp
Need Help?
See Troubleshooting for help with configuration issues.