Configuration Reference
Complete reference for all BactScout configuration options.
Configuration File Location
BactScout looks for configuration in this order:
- File specified with
-cflag:pixi run bactscout qc data/ -c /path/to/config.yml bactscout_config.ymlin current directory- Built-in defaults
Complete Configuration Example
# Database Configuration
bactscout_dbs_path: 'bactscout_dbs'
sylph_db: 'gtdb-r226-c1000-dbv1.syldb'
metrics_file: 'filtered_metrics.csv'
sylph_db_url: 'https://example.com/database.syldb'
# Quality Control Thresholds
coverage_threshold: 30 # Minimum coverage (x-fold)
contamination_threshold: 10 # Maximum contamination (%)
q30_pass_threshold: 0.80 # Minimum Q30% (0.0-1.0)
read_length_pass_threshold: 100 # Minimum read length (bp)
# MLST Species Configuration
mlst_species:
escherichia_coli: 'Escherichia coli#1'
salmonella_enterica: 'Salmonella enterica'
klebsiella_pneumoniae: 'Klebsiella pneumoniae'
acinetobacter_baumannii: 'Acinetobacter baumannii#1'
pseudomonas_aeruginosa: 'Pseudomonas aeruginosa'
# System Resources
system_resources:
cpus: 2
memory: 4.GB
Configuration Parameters
Database Settings
bactscout_dbs_path
- Type: string
- Default:
'bactscout_dbs' - Description: Directory path for storing reference databases
- Example:
'./databases'or'/opt/bactscout_dbs'
sylph_db
- Type: string
- Default:
'gtdb-r226-c1000-dbv1.syldb' - Description: Filename of Sylph GTDB database
- Note: Must exist in
bactscout_dbs_path
metrics_file
- Type: string
- Default:
'filtered_metrics.csv' - Description: CSV file with species genome metrics (size, GC%)
- Location:
{bactscout_dbs_path}/{metrics_file}
sylph_db_url
- Type: string
- Default:
'https://...sylph.syldb'(built-in) - Description: URL to download Sylph database if not found
- Note: Optional, databases auto-download on first run
Quality Control Thresholds
coverage_threshold
- Type: integer
- Default:
30 - Range: 1-1000
- Unit: x-fold depth
- Description: Minimum required sequencing coverage
- Impact on PASS/FAIL: Sample fails if
coverage < threshold
Recommendations:
- 20 - Lenient, exploratory studies
- 30 - Standard, most applications
- 50+ - Strict, critical applications
q30_pass_threshold
- Type: float
- Default:
0.80 - Range: 0.0-1.0
- Unit: Fraction (0.80 = 80%)
- Description: Minimum fraction of bases with Phred quality ≥30
- Impact on PASS/FAIL: Sample fails if
q30_percent < threshold
Recommendations:
- 0.70 - Lenient (70% bases Q≥30)
- 0.80 - Standard (80% bases Q≥30)
- 0.90 - Strict (90% bases Q≥30)
read_length_pass_threshold
- Type: integer
- Default:
100 - Range: 1-1000
- Unit: Base pairs
- Description: Minimum average read length
- Impact on PASS/FAIL: Sample fails if
mean_read_length < threshold
Recommendations:
- 50 - Short-read platforms with trimming
- 100 - Standard Illumina
- 120+ - Extended reads
contamination_threshold
- Type: float
- Default:
10 - Range: 0-100
- Unit: Percentage
- Description: Maximum allowed contamination from other species
- Impact on PASS/FAIL: Sample fails if
contamination_pct > threshold
Recommendations:
- 5 - Strict, pure culture expected
- 10 - Standard, minor contamination acceptable
- 15+ - Lenient, some contamination tolerated
Advanced QC Thresholds (TIER 1 & TIER 2)
These thresholds control evaluation of additional sequencing quality metrics extracted from fastp reports.
duplication_warn_threshold
- Type: float
- Default:
0.20 - Range: 0.0-1.0
- Unit: Fraction (0.20 = 20% duplicate reads)
- Description: Threshold for warning about PCR bias
- Status: WARNING if duplicates exceed this, FAILED if above
duplication_fail_threshold
Interpretation: - Duplicate reads indicate PCR amplification bias - High values suggest library quality or coverage issues - Typical values: 0.15-0.25 for well-constructed libraries
duplication_fail_threshold
- Type: float
- Default:
0.30 - Range: 0.0-1.0
- Unit: Fraction (0.30 = 30% duplicate reads)
- Description: Threshold for failing sample due to excessive PCR bias
- Impact: Sample status = FAILED if
duplication_rate > threshold
insert_size_min_threshold
- Type: integer
- Default:
200 - Range: 50-1000
- Unit: Base pairs
- Description: Minimum expected insert size
- Status: WARNING if insert size peak < threshold
Interpretation: - Insert size = distance between paired-end reads - Too short may indicate DNA fragmentation - Too long may indicate library preparation issues
insert_size_max_threshold
- Type: integer
- Default:
600 - Range: 100-2000
- Unit: Base pairs
- Description: Maximum expected insert size
- Status: WARNING if insert size peak > threshold
filtering_pass_rate_threshold
- Type: float
- Default:
0.95 - Range: 0.0-1.0
- Unit: Fraction (0.95 = 95% reads pass)
- Description: Minimum percentage of reads passing quality filters
- Status: WARNING if pass rate < threshold
Interpretation: - Fastp removes reads failing quality checks - Low pass rates indicate poor sample quality - Typical range: 0.90-0.98
n_content_threshold
- Type: float
- Default:
0.001 - Range: 0.0-0.01
- Unit: Fraction (0.001 = 0.1% ambiguous bases)
- Description: Maximum allowed fraction of ambiguous (N) bases
- Status: WARNING if N-content > threshold
Interpretation: - N bases indicate uncertain base calls - High N-content suggests poor base-calling confidence - Usually < 0.1% in high-quality sequencing
quality_end_drop_threshold
- Type: integer
- Default:
5 - Range: 1-20
- Unit: Phred quality points
- Description: Maximum acceptable quality drop in final 20 cycles
- Status: WARNING if quality end-drop exceeds threshold
Interpretation: - Quality often decreases toward read ends - Large drops indicate sequencer degradation - Typical drop: 0-5 quality points
Configuration Example with All Thresholds
# Standard QC thresholds
coverage_threshold: 30
contamination_threshold: 10
q30_pass_threshold: 0.80
read_length_pass_threshold: 100
# TIER 1 QC thresholds (duplication, insert size, filtering, N-content)
duplication_warn_threshold: 0.20
duplication_fail_threshold: 0.30
insert_size_min_threshold: 200
insert_size_max_threshold: 600
filtering_pass_rate_threshold: 0.95
n_content_threshold: 0.001
# TIER 2 QC thresholds (quality trends, adapters)
quality_end_drop_threshold: 5
mlst_species
- Type: dictionary (key-value pairs)
- Default: Includes 5 species
- Description: Species with available MLST schemes
Format:
mlst_species:
species_key: 'Genus species name'
Key requirements:
- species_key: Used as database directory name (must match bactscout_dbs/{species_key}/)
- value: Scientific name used for species matching
Default species:
mlst_species:
escherichia_coli: 'Escherichia coli#1'
salmonella_enterica: 'Salmonella enterica'
klebsiella_pneumoniae: 'Klebsiella pneumoniae'
acinetobacter_baumannii: 'Acinetobacter baumannii#1'
pseudomonas_aeruginosa: 'Pseudomonas aeruginosa'
Adding new species:
- Prepare MLST database in ARIBA format
- Place in
bactscout_dbs/{species_key}/ - Add to config:
mlst_species: my_species: 'Genus species' - Update species name in
filtered_metrics.csvif needed
System Resources
system_resources.cpus
- Type: integer
- Default:
2 - Description: Minimum CPUs required (informational)
- Note: Actual thread count controlled by
-tflag
system_resources.memory
- Type: string
- Default:
'4.GB' - Format:
'{number}.{unit}'where unit is KB, MB, GB, TB - Description: Minimum RAM required (informational)
Configuration Use Cases
Lenient QC (More Samples PASS)
For exploratory studies, low-throughput, or difficult samples:
coverage_threshold: 20
q30_pass_threshold: 0.70
read_length_pass_threshold: 80
contamination_threshold: 15
Standard QC (Recommended)
For typical quality control and research:
coverage_threshold: 30
q30_pass_threshold: 0.80
read_length_pass_threshold: 100
contamination_threshold: 10
Strict QC (Fewer Samples PASS)
For critical applications requiring high confidence:
coverage_threshold: 50
q30_pass_threshold: 0.90
read_length_pass_threshold: 120
contamination_threshold: 5
Diagnostic Lab QC
For clinical/diagnostic samples:
coverage_threshold: 100
q30_pass_threshold: 0.90
read_length_pass_threshold: 100
contamination_threshold: 2
Epidemiology Focus
For outbreak investigations prioritizing species ID:
coverage_threshold: 20
q30_pass_threshold: 0.75
read_length_pass_threshold: 80
contamination_threshold: 10
mlst_species:
# Include all relevant species
Using Custom Configurations
Create and Use Custom Config
# Create custom config
cp bactscout_config.yml my_lenient_config.yml
# Edit thresholds
nano my_lenient_config.yml
# Use in analysis
pixi run bactscout qc data/ -c my_lenient_config.yml
Per-Batch Configuration
# Batch 1: Strict QC
pixi run bactscout qc batch1/ -c strict_config.yml -o batch1_results/
# Batch 2: Lenient QC
pixi run bactscout qc batch2/ -c lenient_config.yml -o batch2_results/
# Generate reports with different thresholds
pixi run bactscout summary batch1_results/
pixi run bactscout summary batch2_results/
Override at Command Line
While command-line threshold overrides aren't supported, you can:
- Create config file with desired values
- Pass with
-cflag - Or modify the config file before running
Database Management
Database Locations
bactscout_dbs/
├── gtdb-r226-c1000-dbv1.syldb # Sylph GTDB database
├── filtered_metrics.csv # Genome metrics
├── escherichia_coli/ # MLST database
│ └── [ARIBA database files]
├── salmonella_enterica/
│ └── [ARIBA database files]
├── klebsiella_pneumoniae/
│ └── [ARIBA database files]
├── acinetobacter_baumannii/
│ └── [ARIBA database files]
└── pseudomonas_aeruginosa/
└── [ARIBA database files]
Updating Databases
Update GTDB in config:
# Old database
sylph_db: 'gtdb-r226-c1000-dbv1.syldb'
# New database (after downloading)
sylph_db: 'gtdb-r227-c1000-dbv1.syldb'
Custom Reference Database
To use a custom reference database:
- Create Sylph index from your genomes
- Place in
bactscout_dbs/ - Update config with filename:
sylph_db: 'my_custom_ref.syldb'
Validation
Configuration Validation
BactScout validates configuration on startup:
pixi run bactscout qc data/ -c config.yml
Checked: - File exists and is readable - YAML syntax valid - Required keys present - Threshold values in valid ranges - Database files accessible
Example error:
Error: Configuration validation failed
- coverage_threshold must be > 0
- q30_pass_threshold must be 0.0-1.0
- Database file not found: gtdb-r226.syldb
Threshold Validation
Values are checked for reasonableness:
| Parameter | Min | Max |
|---|---|---|
coverage_threshold |
1 | 10000 |
q30_pass_threshold |
0.0 | 1.0 |
read_length_pass_threshold |
1 | 10000 |
contamination_threshold |
0 | 100 |
Configuration Environment Variables
Set configuration via environment variables:
export BACTSCOUT_COVERAGE_THRESHOLD=20
export BACTSCOUT_Q30_THRESHOLD=0.75
export BACTSCOUT_CONTAMINATION_THRESHOLD=15
pixi run bactscout qc data/
Not yet implemented but planned for future release.
Example Configurations
Single-Species MLST Study
coverage_threshold: 30
q30_pass_threshold: 0.80
read_length_pass_threshold: 100
contamination_threshold: 5
mlst_species:
escherichia_coli: 'Escherichia coli#1'
Multi-Species Surveillance
coverage_threshold: 20
q30_pass_threshold: 0.75
read_length_pass_threshold: 80
contamination_threshold: 10
mlst_species:
escherichia_coli: 'Escherichia coli#1'
salmonella_enterica: 'Salmonella enterica'
klebsiella_pneumoniae: 'Klebsiella pneumoniae'
acinetobacter_baumannii: 'Acinetobacter baumannii#1'
pseudomonas_aeruginosa: 'Pseudomonas aeruginosa'
High-Throughput Screening
coverage_threshold: 15
q30_pass_threshold: 0.70
read_length_pass_threshold: 75
contamination_threshold: 20
Clinical Testing
coverage_threshold: 100
q30_pass_threshold: 0.95
read_length_pass_threshold: 100
contamination_threshold: 1
system_resources:
cpus: 8
memory: 16.GB
Troubleshooting Configuration
"Configuration file not found"
# Check if file exists
ls -la bactscout_config.yml
# Use full path
pixi run bactscout qc data/ -c /full/path/to/config.yml
"Invalid YAML syntax"
Check file with YAML validator:
python -c "import yaml; yaml.safe_load(open('bactscout_config.yml'))"
"Database not found"
Ensure database files exist:
ls -la bactscout_dbs/
# Should show:
# gtdb-r226-c1000-dbv1.syldb
# filtered_metrics.csv
# [species folders]
"Threshold values ignored"
Thresholds from config file are always used. To override:
1. Create new config file with desired values
2. Pass with -c flag
See Also
- Configuration Getting Started - Quick config guide
- Quality Control Guide - Understanding thresholds
- Troubleshooting Guide - Common issues