Frequently Asked Questions

Installation & Setup

Q: Do I need to install all the dependencies myself?

A: No! Pixi handles this automatically. Just install Pixi, run pixi install, and all dependencies (fastp, Sylph, ARIBA, etc.) are installed in an isolated environment.

Q: Can I use BactScout with conda or pip?

A: While it's technically possible, we strongly recommend Pixi for: - Reproducibility (lock files) - Isolated environments (no conflicts) - Automatic package downloads - CI/CD compatibility

If you prefer conda/pip, you'll need to manually install: Python 3.11+, fastp, sylph, ariba, stringmlst, and required Python packages.

Q: Can I run BactScout on Windows?

A: Not directly - BactScout uses Unix-based tools (fastp, Sylph, ARIBA). Options: - Use WSL2 (Windows Subsystem for Linux 2) - Use Docker for Windows - Run on a Linux system

Q: How much disk space do I need?

A: - BactScout installation: ~2-3 GB (with dependencies) - Reference databases: ~20-30 GB - Per sample output: ~500 MB - 2 GB (varies with read depth) - Typical batch (100 samples): ~100-200 GB including databases

Running BactScout

Q: What's the difference between `qc` and `collect`?

A: - qc: Process multiple samples in a directory (batch mode) - collect: Process a single pair of FASTQ files (single sample mode)

Use qc for high-throughput screening, collect for individual samples or integration.

Q: How long does analysis take?

A: Typical processing time per sample: - Small (100k reads): 2-5 minutes - Medium (1M reads): 5-15 minutes
- Large (>1M reads): 15-30+ minutes

Depends on thread count, system speed, and read depth.

Q: Can I run multiple analyses in parallel?

A: Yes, you can run multiple BactScout instances: - Different samples/batches on same machine - Use different output directories: -o output_batch1/, -o output_batch2/ - Monitor resource usage (CPU, memory) - Recommended: 1-2 instances to avoid resource contention

Q: What's the maximum number of threads I can use?

A: Use up to your system's CPU count:

nproc  # Show available CPUs

pixi run bactscout qc data/ -t $(nproc)  # Use all CPUs

More threads = faster but higher memory usage. Balance with available RAM.

Q: Can I skip quality checks?

A: Yes, use --skip-preflight:

pixi run bactscout qc data/ --skip-preflight

This skips FASTQ format validation. Use only if you trust your input data.

Results & Quality Control

Q: What does "FAIL" in quality_pass mean?

A: A sample fails when any metric doesn't meet thresholds: - Coverage < 30x - Q30% < 80% - Read length < 100 bp - Contamination > 10%

Check which metric failed and see Quality Control Guide.

Q: Can I adjust quality thresholds?

A: Yes, edit bactscout_config.yml:

coverage_threshold: 20           # Stricter or more lenient
q30_pass_threshold: 0.75         # Lower for relaxed QC
read_length_pass_threshold: 80
contamination_threshold: 15

Then run: pixi run bactscout qc data/ -c my_config.yml

Q: What if my samples don't meet thresholds?

A: First, understand why: 1. Review the specific failing metric 2. Check fastp HTML report for details 3. Consider if threshold is appropriate for your application

Options: - Lower thresholds if acceptable for your study - Re-sequence with higher depth - Improve sample prep/quality

Q: Why do some samples have "Unknown" species?

A: Reasons: - Species not in GTDB database - Poor quality/coverage (can't identify confidently) - Heavily contaminated sample - Novel/rare organism

Solutions: - Check if contamination is high (>5%) - Try manual BLAST identification - Add custom reference genomes

Q: How do I interpret contamination % in mixed samples?

A: If you intentionally have mixed samples: - Contamination % shows other species present - Use Sylph output to see all detected species - May need specialized metagenomics tools for detailed analysis

Q: Can I reprocess samples with different thresholds?

A: Sort of: - Re-running BactScout: It overwrites previous results - Using summary with different config: Changes only quality_pass determination - Better approach: Run once, then filter results in analysis script

MLST & Strain Typing

Q: Which species support MLST typing?

A: BactScout includes schemes for: - Escherichia coli - Salmonella enterica - Klebsiella pneumoniae - Acinetobacter baumannii - Pseudomonas aeruginosa

To add more, install additional ARIBA databases.

Q: What does sequence type (ST) mean?

A: ST is a unique number assigned based on alleles at 7 housekeeping genes: - Same ST = likely same source/lineage - Different ST = different strain - Novel ST = new combination not in database - Useful for epidemiological tracking and outbreak investigation

Q: What if MLST is partial or fails?

A: When MLST cannot assign a valid ST, the status is set to WARNING (not FAIL). This is informational and does NOT cause the overall sample to fail QC.

Common causes of missing ST: - Low coverage over specific housekeeping genes - Sample contamination reducing coverage - Sequence divergence (novel allele combinations) - Unreported/novel ST not in pubMLST database

Important: Missing MLST does not affect sample quality assessment. A sample can: - ✅ PASS overall QC with missing ST (if other metrics pass) - ✅ Have a valid MLST ST and still FAIL QC (if coverage/Q30/etc. fails)

Solutions if you need MLST: - Increase sequencing depth for better coverage - Verify sample hasn't degraded - Check if novel ST can be submitted to pubMLST - Review housekeeping gene coverage in sample report

Q: Does missing MLST mean my sample failed?

A: No! MLST status is separate from sample QC pass/fail. Example:

Scenario	Q30	Coverage	Reads	Contamination	MLST	Overall Result
Good sample, ST found	PASS	PASS	PASS	PASS	PASSED	✅ PASS
Good sample, no ST	PASS	PASS	PASS	PASS	WARNING	✅ PASS
Poor Q30, has ST	FAIL	PASS	PASS	PASS	PASSED	❌ FAIL

The sample quality is determined by sequencing metrics (coverage, Q30, contamination, read length). MLST is optional strain typing information.

Data Management

Q: How should I organize input data?

A: For qc command (batch processing):

data/
├── sample_001_R1.fastq.gz
├── sample_001_R2.fastq.gz
├── sample_002_R1.fastq.gz
├── sample_002_R2.fastq.gz
└── ...

Supported naming: - *_R1.fastq.gz, *_R2.fastq.gz - *_1.fastq.gz, *_2.fastq.gz - *_R1.fq.gz, *_R2.fq.gz

Q: Can I use compressed (gzip) FASTQ files?

A: Yes! BactScout handles both: - .fastq.gz (compressed) - recommended for storage - .fastq (uncompressed) - faster I/O but more space

Q: How do I manage large result directories?

A: Strategies: - Archive old results: tar -czf batch_2024-01.tar.gz bactscout_output/ - Compress HTML reports: gzip bactscout_output/*/fastp_report.html - Keep only CSV summaries, delete per-sample files if needed - Backup final_summary.csv separately

Q: Can I export results to other formats?

A: BactScout outputs CSV (easy to convert):

import pandas as pd

df = pd.read_csv('bactscout_output/final_summary.csv')

# Excel
df.to_excel('results.xlsx', index=False)

# JSON
df.to_json('results.json', orient='records')

# SQL database
# df.to_sql('samples', sqlite3.connect('results.db'))

Configuration

Q: How do I set permanent defaults?

A: Edit bactscout_config.yml in project root. These become defaults for all runs.

Q: Can I have multiple configs?

A: Yes, create copies and use with -c flag:

pixi run bactscout qc data/ -c strict_config.yml
pixi run bactscout qc data/ -c lenient_config.yml

Q: How do I add new reference genomes?

A: 1. Format as FASTA files 2. Place in bactscout_dbs/species_name/ 3. Update Sylph GTDB index 4. Update ARIBA if adding MLST/resistance

See Configuration Guide.

Performance & Optimization

Q: How can I speed up analysis?

A: 1. Increase threads: -t 8 or -t $(nproc) 2. Use SSD storage for I/O speed 3. Pre-filter low-quality reads with fastp 4. Run multiple instances on different samples

Q: How can I reduce memory usage?

A: 1. Decrease threads: -t 2 uses less memory than -t 8 2. Reduce batch size (process fewer samples at once) 3. Enable streaming/chunked processing if available 4. Upgrade to system with more RAM

Q: Is there a GUI?

A: No, BactScout is command-line only. But: - Results are CSV (easily viewed in Excel, R, Python) - HTML reports are interactive and visual - Results can be loaded in analysis tools of your choice

Troubleshooting

Q: What's in the log output?

A: BactScout outputs: - Progress indicators (which sample, which step) - Quality metrics for each sample - Error/warning messages - File locations of results

Q: How do I report a bug?

A: 1. Check GitHub Issues 2. Provide: - BactScout version: pixi run bactscout --version - Command used - Error message and full output - System info: uname -a - Sample data (if possible, anonymized) 3. Create new issue with detailed reproduction steps

Q: Where are results stored?

A: By default in bactscout_output/: - Use -o to specify different location - Per-sample results in Sample_XXX/ directories - Batch summary in final_summary.csv

Q: Can I resume interrupted analysis?

A: Not directly. However: - Check which samples were already processed - Delete incomplete samples from output - Run QC again (it will reprocess everything) - Consider per-sample collect command for better control

Q: Can I integrate with other pipelines?

A: Yes: - BactScout outputs standard CSV - Results can feed into assembly, SNP calling, AMR detection, etc. - Python/R scripts can read CSV and prepare data

Contributing

Q: How can I contribute?

A: See Contributing Guide for details on: - Reporting issues - Submitting code changes - Testing requirements - Development setup

Q: Can I add my own analyses?

A: BactScout is modular and extensible: 1. Fork the repository 2. Add your analysis in new module 3. Integrate into CLI 4. Submit pull request

Still Have Questions?

Check Troubleshooting Guide for common issues
Read Quality Control Guide for QC interpretation
Review Output Format for column descriptions
Check configuration Examples
Search GitHub Issues

If your question isn't answered here, please open an issue on GitHub!