Skip to content

Configuration

BactScout uses a YAML configuration file to define analysis parameters and thresholds.

Default Configuration

The default bactscout_config.yml (project root) is the single source of truth for thresholds, database locations, and system requirements. Below is the current configuration shipped with this repository:

bactscout_dbs_path: 'bactscout_dbs'
sylph_db: 'gtdb-r226-c1000-dbv1.syldb'
metrics_file: 'filtered_metrics.csv'
sylph_db_url: 'http://faust.compbio.cs.cmu.edu/sylph-stuff/gtdb-r226-c1000-dbv1.syldb'

# QC Thresholds - Support both WARN and FAIL levels
# Coverage thresholds (in x)
coverage_warn_threshold: 30
coverage_fail_threshold: 20

# Contamination thresholds (% of top species, lower is more contaminated)
contamination_warn_threshold: 5
contamination_fail_threshold: 10

# Q30 thresholds (fraction of bases >= Q30)
q30_warn_threshold: 0.80
q30_fail_threshold: 0.70

# Read length thresholds (mean read length in bp)
read_length_warn_threshold: 80
read_length_fail_threshold: 100

# Duplication rate thresholds (fraction of duplicate reads)
duplication_warn_threshold: 0.20
duplication_fail_threshold: 0.30

# GC content failure threshold (% of reads with unexpected GC content)
gc_fail_percentage: 5

# N-content threshold (fraction of reads with too many N's)
n_content_threshold: 0.001
# Adapter overrepresented sequences threshold (number of overrepresented sequences)
adapter_overrep_threshold: 5

mlst_species:
  escherichia_coli: 'Escherichia coli#1'
  salmonella_enterica: 'Salmonella enterica'
  klebsiella_pneumoniae: 'Klebsiella pneumoniae'
  acinetobacter_baumannii: 'Acinetobacter baumannii#1'
  pseudomonas_aeruginosa: 'Pseudomonas aeruginosa'
system_resources:
  cpus: 2
  memory: 4.GB

Custom Configuration

Create a custom config and pass it to BactScout:

cp bactscout_config.yml my_config.yml
# Edit my_config.yml as needed
pixi run bactscout qc data/ -c my_config.yml

Configuration Parameters

Database Settings

Parameter Default Description
bactscout_dbs_path bactscout_dbs Directory for storing reference databases
sylph_db gtdb-r226-c1000-dbv1.syldb Sylph GTDB database filename
metrics_file filtered_metrics.csv Species-specific genome size and GC content metrics
sylph_db_url [See config] URL to download Sylph database if not found

Using alternative Sylph databases

Sylph provides a set of pre-built reference databases targeting different trade-offs of sensitivity, specificity and runtime (see the Sylph pre-built databases page.

  • The default database shipped in BactScout (gtdb-r226-c1000-dbv1.syldb) is compact and fast, and works well for many routine surveillance tasks, but it may be smaller than the largest available references and therefore slightly less sensitive for rare or unusual species.
  • Larger, more comprehensive Sylph databases (for example full GTDB builds or RefSeq-style databases) include many more taxa and yield higher sensitivity at the cost of increased disk usage, higher memory requirements, and longer profiling runtimes.

If you want to use an alternative Sylph database, update these fields in bactscout_config.yml:

sylph_db: 'my-large-db.syldb'
sylph_db_url: 'https://example.org/path/to/my-large-db.syldb'
bactscout_dbs_path: '/path/to/local/dbs'

Notes and recommendations:

  • You can point sylph_db_url at any HTTP(S)-accessible .syldb file; the preflight command will try to download it into bactscout_dbs_path when missing.
  • Expect larger .syldb files to require more disk space (tens to hundreds of GB for very large RefSeq/GTDB builds) and more RAM during profiling. Test on a small subset first to measure performance impact.
  • If you are running BactScout inside containers, ensure the database path is mounted into the container and that permissions allow the process to read the file.
  • If you need the highest sensitivity for taxonomic assignment, pick one of the larger pre-built databases from the Sylph docs, but accept that profiling will take longer.

Example: switch to a larger GTDB build (pseudo-URL):

sylph_db: 'gtdb-r226-full.syldb'
sylph_db_url: 'https://sylph-docs.github.io/pre-built-databases/gtdb-r226-full.syldb'

Use the pixi run bactscout preflight command after updating the config to validate the database download and ensure tool availability.

Quality Control Thresholds

Many thresholds are expressed as WARN/FAIL pairs; tests and the CLI use these separately to produce PASS/WARNING/FAIL decisions.

Parameter Default Description
coverage_warn_threshold 30 Coverage (×) above which samples are considered OK (warning threshold)
coverage_fail_threshold 20 Coverage (×) below which samples are considered FAIL
contamination_warn_threshold 5 Contamination (%) warning threshold (percent of reads not from dominant species)
contamination_fail_threshold 10 Contamination (%) fail threshold
q30_warn_threshold 0.80 Fraction of bases with Q ≥ 30 for WARN
q30_fail_threshold 0.70 Fraction of bases with Q ≥ 30 for FAIL
read_length_warn_threshold 80 Mean read length (bp) WARN threshold
read_length_fail_threshold 100 Mean read length (bp) FAIL threshold
duplication_warn_threshold 0.20 Fraction duplicate reads WARN
duplication_fail_threshold 0.30 Fraction duplicate reads FAIL
gc_fail_percentage 5 The GC ranges are determined via Qualibact for a known species. These values are used to WARN. This option controlls the adjustment to the cutoff percent with unexpected GC content that triggers FAIL
n_content_threshold 0.001 Fraction of reads with too many N's that triggers FAIL
adapter_overrep_threshold 5 Number of overrepresented adapters before warning/failing

Other QC applied (auto-determined)

In addition to the thresholds above, BactScout applies several QC checks that are automatically determined from species assignment and the metrics database:

  • Genome size — obtained from the metrics_file (QualiBact-derived values) and the predicted species from Sylph; used to compute an estimated coverage when Sylph-derived coverage is unavailable.
  • GC ranges — species-specific GC lower/upper bounds are read from the QualiBact-derived metrics file. A sample GC within the species bounds is considered PASS; values slightly outside the bounds trigger a WARNING; values outside the bounds plus the configured gc_fail_percentage buffer are flagged as FAIL.

These values are inferred at runtime (species + metrics) and do not require manual setting in the config.

MLST Species

Define species with available MLST schemes:

mlst_species:
  escherichia_coli: 'Escherichia coli#1'      # Species directory: escherichia_coli
  salmonella_enterica: 'Salmonella enterica'   # Species directory: salmonella_enterica

The key is used as the database directory name, the value is the scientific name for species matching.

For a complete list of PUBMLST-format names supported by BactScout, see the dedicated page and raw list included with the docs:

Use the exact PUBMLST-format key as the directory name under bactscout_dbs/ when adding MLST databases.

To add MLST support for a new species: 1. Prepare MLST databases following PUBMLST format 2. Add to config:

mlst_species:
  my_species: 'Genus species'
3. Place databases in bactscout_dbs/my_species/ 4. Run BactScout normally

System Resources

system_resources:
  cpus: 2              # Minimum CPUs required
  memory: 4.GB         # Minimum memory required

Adjusting Thresholds

Example: Lower coverage threshold for low-depth studies

coverage_threshold: 20  # Instead of 30x
q30_pass_threshold: 0.75  # Instead of 0.80 (75% instead of 80%)
read_length_pass_threshold: 80  # Instead of 100 bp

Need Help?

See Troubleshooting for help with configuration issues.