Scaling BactScout for hundreds of samples
This page describes practical ways to run BactScout at scale (tens → hundreds → thousands of samples). It covers two common deployment modes:
- multi-threaded / multi-process runs on a single server (workstation / small node)
- cluster / HPC-style orchestration (job arrays or a workflow engine such as Nextflow)
The core principle is simple: run the per-sample collection/analysis step independently for each sample (the bactscout collect command) and then aggregate the per-sample summaries into a single batch summary (the bactscout summary command). Running samples independently makes the workflow trivially parallel and fault tolerant.
Recommended workflow (high level)
- Prepare an input table or layout of paired reads (R1/R2) and a configuration file (
bactscout_config.yml). - For each sample, run the per-sample subcommand
bactscout collect <R1> <R2> --output <sample_dir> --threads Nso that each sample writes a self-contained output directory (e.g.<output_dir>/<sample_id>/). - When all per-sample runs have finished (or as they finish), aggregate results by running
bactscout summary <top_level_output_dir> --output final_summary.csv.
This pattern works the same whether you run samples locally in parallel or dispatch them to an HPC scheduler or workflow engine.
Local / multi-threaded mode
If you have a machine with many CPU cores and ample RAM, you can process multiple samples in parallel by launching one bactscout collect process per sample and constraining threads per-process. Example options:
- Use GNU Parallel or a small Python script to iterate over samples and run
bactscout collectwith--threadsset to the number of cores you want each sample to consume. - Keep the sum of
--threadsacross concurrent processes less than or equal to the physical cores available. Leave some headroom for OS and I/O. - If disk I/O is the bottleneck (many compressed reads being read concurrently), reduce the number of concurrent jobs or use faster storage (local NVMe) and stage data to node-local disks.
Example (bash + GNU parallel):
# samples.tsv: <sample_id>\t<r1.fastq.gz>\t<r2.fastq.gz>
cat samples.tsv | parallel -j 8 \
'bactscout collect {2} {3} --output results/{1} --threads 4 --config bactscout_config.yml'
Notes:
- Use
--threadsto control per-process parallelism. BactScout will pass this to CPU-bound tools it calls. - Use
--report-resources(if you want resource usage in the per-sample summary) so you can tune resource allocation later.
HPC / cluster mode
On an HPC cluster it is common to dispatch one sample per job (or a small group of samples per job). Two popular ways to do this are:
- job arrays (SLURM, SGE): submit many similar jobs that each run
bactscout collectfor one sample - workflow manager (Nextflow, Snakemake, Cromwell): let the engine discover inputs and schedule processes on cluster nodes
Job-array example (SLURM pseudo-script):
#SBATCH --array=1-200%40 # run up to 40 tasks concurrently
#SBATCH --cpus-per-task=4
#SBATCH --mem=8G
#SBATCH --time=04:00:00
SAMPLE=$(sed -n "${SLURM_ARRAY_TASK_ID}p" samples.tsv | cut -f1)
R1=$(sed -n "${SLURM_ARRAY_TASK_ID}p" samples.tsv | cut -f2)
R2=$(sed -n "${SLURM_ARRAY_TASK_ID}p" samples.tsv | cut -f3)
bactscout collect "$R1" "$R2" --output results/$SAMPLE --threads $SLURM_CPUS_PER_TASK --config bactscout_config.yml
Advantages:
- Simple mapping between sample and job; failed samples are easy to retry
- You can size memory/CPU per job based on observed behaviour
Considerations:
- Avoid writing all sample outputs to the same shared directory on some filesystems (e.g., NFS) with very high concurrency. Instead, stage to node-local storage and copy results out.
Using Nextflow (recommended for production pipelines)
Nextflow is an excellent choice for running BactScout across many samples. The nextflow_example directory (see the github repo) contains a minimal, documented workflow (nextflow.nf + nextflow.config) that demonstrates:
- automatic discovery of paired-end FASTQ pairs
- per-sample
collectruns in an isolated process - publishing per-sample outputs into per-sample directories
- aggregation of per-sample summaries into a single
final_summary.csv
How the example workflow works (walkthrough):
- Parameters: the workflow accepts
--input_dir,--output_dir,--config, and--threads. You can override these at runtime. - Input discovery: the workflow builds a channel of tuples
(sample_name, read1, read2)by matching common naming patterns (*_R1.fastq.gz/*_R2.fastq.gzor*_1.fastq.gz/*_2.fastq.gz). collect_sampleprocess: for each tuple the workflow runs a containerisedbactscout collectcommand. Important features in the example:stageInMode 'copy'to safely copy input files to the compute node (reduces shared-FS load)publishDir "${params.output_dir}/${sample_name}", mode: 'copy'which ensures each sample's outputs land in a dedicated directory under the main output dir- container image is set in the example; set this to your own image or remove container directives if you run on systems without container support
final_summaryprocess: once per-sample summaries are available, the workflow runsbactscout summaryto create a singlefinal_summary.csvin the top-level output directory.
Why use Nextflow:
- Proven scheduling, retry, and resource isolation across many cluster types (SLURM, PBS, SGE, Kubernetes).
- Automatic staging, container support (Docker/Singularity), and reproducible runs.
- Built-in logging and provenance tracking; easy to resume failed runs.
Practical tips for adapting the example:
- Container image: adjust the
containerentry innextflow.nfto a BactScout image you manage or remove container directives to use the system installation. - Threads and resource hints: set
params.threadsand configureprocessdefaults innextflow.config(cpus, memory). Tune per-sample values to match the tools that BactScout invokes on your inputs. - Staging and storage: if you're on a shared filesystem, prefer node-local staging (copy) and publish in bulk. If your cluster has a high-performance parallel filesystem, you can choose
stageInModeaccordingly. - Failure / retry: Nextflow will retry tasks; set sensible
maxRetriesin process config if desired.
Process details: collect_sample and final_summary
The nextflow_example/nextflow.nf workflow contains two main process blocks that implement the per-sample collection and the final aggregation. Below are the relevant excerpts and a short explanation of each directive so you can adapt them for your site.
1) collect_sample (runs bactscout collect for a single sample)
process collect_sample {
tag { sample_name }
container 'docker.io/happykhan/bactscout:latest'
stageInMode 'copy'
publishDir "${params.output_dir}/${sample_name}", mode: 'copy'
input:
tuple val(sample_name), path(read1), path(read2)
output:
path("${sample_name}/${sample_name}_summary.csv"), emit: summary
path("${sample_name}/**"), emit: all_outputs
script:
"""
bactscout collect \
${read1} \
${read2} \
--output . \
--threads ${params.threads} \
--config /app/bactscout_config.yml 2>&1
"""
}
Key notes:
tag { sample_name }prints the sample name into Nextflow logs for easy tracing.containerpoints to an image that contains BactScout and any runtime deps — replace with your image or remove if using system Python.stageInMode 'copy'copies inputs to the compute node (helps reduce shared-FS contention).publishDir "${params.output_dir}/${sample_name}", mode: 'copy'ensures each sample's outputs are published into a dedicated directory under your main output dir.inputandoutputdeclarations define the files passed into and out of the process;emitnames allow workflow wiring (e.g., summaries collected by the aggregator).
2) final_summary (aggregates per-sample summaries)
process final_summary {
container 'docker.io/happykhan/bactscout:latest'
publishDir "${params.output_dir}", mode: 'copy'
input:
path(summaries)
output:
path("final_summary.csv")
script:
"""
bactscout summary \
. \
--output .
"""
}
Key notes:
- The
final_summaryprocess collects the per-sample_summary.csvfiles and runsbactscout summaryto produce a consolidatedfinal_summary.csvin the top-level output directory. - You can tune this process to accept summaries as a channel of files, or pass a glob depending on how you wire the workflow; the example uses
collect_results.summary.collect()to pass a list of emitted summary paths.
Together these two processes implement the safe, per-sample execution model: independent, containerised per-sample runs that publish self-contained outputs, followed by a small aggregation step that is cheap and easily retryable.
I/O, storage and filesystem considerations
- Small files: avoid creating millions of tiny files on metadata-limited filesystems. Keep per-sample outputs grouped under a single directory per sample.
- Compression: BactScout reads compressed FASTQs; avoid repeatedly decompressing the same input across many concurrent jobs on a shared filesystem — stage or copy inputs to local disk when possible.
- Temporary directories: prefer node-local scratch (e.g.,
$TMPDIR) for temporary files and then copy the final outputs back to the shared output directory.
Resource monitoring and tuning
- Start with conservative
--threadsand per-job memory values; collect resource usage (use--report-resourceswhen runningbactscout collect) and tune based on observed CPU and memory footprints. - Typical per-sample settings depend on the tools BactScout calls (fastp, Sylph, stringMLST). On modest inputs,
--threads 2-4and 4–8 GB RAM per sample is a reasonable starting point; adjust upwards for larger genomes or deeper coverage.
Debugging and retries
- Run a single sample locally with verbose logging to confirm the config and container bindings before launching large batches.
- Use the workflow engine's retry behavior or job-array retry logic to re-run failed samples. Keep per-sample output directories intact so you can inspect logs and intermediate files for failures.
Example: adapt the nextflow_example for your site
- Fork
nextflow_example(This is in the github repo)and setparams.input_dirandparams.output_dirto your paths. - Build or choose a container image that bundles BactScout and its runtime dependencies (or install BactScout in the cluster environment and remove the
containerdirectives). - Tune
processdefaults innextflow.config(cpus, memory) and/or set process-level resource hints forcollect_sample. - Test with a small set of samples, confirm
final_summary.csvcontents, then scale to the whole cohort.
BactScout is intentionally designed so each sample is an independent unit of work. That makes it straightforward to scale: choose the orchestration primitive you prefer (parallel processes, job arrays, or a workflow engine), size resources per-sample, stage inputs to local storage when appropriate, and aggregate results with bactscout summary.