Welcome to bgrr|’s documentation!

bgrr| (“bgrrl”, Bacterial Genome Reconstruction (aka assembly) and Recognition (aka annotation) Pipeline) is a Python3-based workflow system for large-scale bacterial genomics studies. bgrr| is used by the Earlham Institute (EI) Core Bioinformatics group to deliver bacterial genome assembly and annotation projects as part of EI’s national capability. bgrr| is a customisable, Snakemake-driven python application wrapping established and state-of-the-art bioinformatics software tools such as unicycler and prokka. The pipeline consists of individual modules for preprocessing, assembly, annotation, quality assurance, and project-specific finalisation and packaging. Individual steps and settings of the pipeline, such as choice of assembly software (currently supported: unicycler, spades, velvet-optimizer) or trimming/preprocessing parameters can be customised according to user preference or project requirements. bgrrl offers two choices of genome annotation - de novo via prokka and reference-based via ratt. Report generation and quality assurance of the produced assemblies and annotations is performed via the independent qaa (“kaa”, Quality Assemblies and Annotations) workflow system, which currently wraps quast, qualimap, busco, and blobtools for quality assessment, which is then collated into a multiqc report.

bgrrl has been designed to deal with large numbers of samples of varying size and quality. To ensure pipeline stability at runtime, we perform a set of pre-assembly checks on read quality, kmer-distribution and GC-content, and survey assemblies with tadpole to determine the assemble-ability of a sample (“survey-stage”). The pipeline is easy to operate and runs with minimal user interaction but also allows for revising the outcomes of each stage if required. The produced assemblies and annotations are complemented with comprehensive reports for each stage.

So far, bgrr| has been successfully applied to over 8000 samples distributed over five projects with different project requirements including a large part of the sequencing data generated by the 12K Salmonella Project between the University of Liverpool and Earlham Institute.