MATERIALS AND METHODSData sourcesThe genomic and metagenomic datasets used in this study were downloaded from the NCBI downloaded from the Sequence Read Archive (SRA) at (http://www.ncbi.nlm.nih.gov/sra). Isolate genome datasets were Escherichia coli KLY isolate (SRR1424625), P. aeruginosa VA-134 isolate (SRR2939129) and Streptococcus pyogenes M39 isolate (SRR5280756). For simulating a metagenomic dataset, these three isolates were combined with mouse sequencing data (SRR1752459) to increase sample complexity. For the metagenomic study, we used the following datasets for ground water, deep water biosphere (SRR10598175); Lake Redon in Central Pyrenees, Spain (ERR472738); Artic permafrost (SRR11195315); and peatland wetlands (SRR5823773). Unassembled read datasets of phage therapy candidates were for 66 antibiotic resistant P. aeruginosa isolates that are distributed by the CDC & FDA Antibiotic Resistance Isolate Bank. All sequencing data were from Illumina sequencing platforms and downloaded as SRA files with Fastq files extracted by executing the SRA toolkit command fastq-dump with paired-end files split (17).CasCollect development and targeted gene assemblyCasCollect was developed in Python and Perl languages with the pipeline publicly available for download under the terms of the GNU General Public License version 3 at https://github.com/sandialabs/CasCollect. Installation requirements and documentation are provided in the download. A check script for dependencies will download and extract missing software. All tests reported for this work were performed on a system setup with 100 Intel Xeon CPUs at 2.40GHz and 2 Tb RAM. CasCollect was designed for a POSIX-compliant operating system that include Unix and Linux distributions. CasCollect dependencies are BBTools 38.84 (https://jgi.doe.gov/data-and-tools/bbtools/), Seqtk (https://github.com/lh3/seqtk), FragGeneScanPlus (FGS+, https://github.com/hallamlab/FragGeneScanPlus), HMMER v3.3 (http://hmmer.org/), VSEARCH (https://github.com/torognes/vsearch), SPAdes 3.14.1 (http://cab.spbu.ru/software/spades/) and CRISPRCasFinder (https://crisprcas.i2bc.paris-saclay.fr/). CRISPRCasFinder was parallelized through a Perl script for the number of CPUs defined by the user input and skips contigs below a size cutoff to generate a GFF3 output file. The CasCollect pipeline includes read filtering, seed generation, subset read expansion, assembly, and annotation for cas genes and CRISPR arrays.CasCollect parametersThe CasCollect pipeline has several parameters that can be altered for user specific workflows, described in detail with the -h command. The short DNA sequencing read reads input can accommodate single- or paired-end sequencing data with -single file.fastq or -fwd file.fastq and -rev file.fastq, respectively (Figure (Figure1A).1A). The filtering option is set to false by default and the sequencing reads will be run for the seed generation and downstream workflow (Figure (Figure1B).1B). Setting the flag --trim will perform adapter trimming and read merging for paired-end data. The flag --clean performs the trim function with the addition of removing sequencing reads that match a user-defined set of undesired nucleic acid sequence(s) set with -ref file.fasta. For seed generation, protein mode searching for cas genes by default using a set of 120 HMM profiles (13) included with the program (Figure (Figure1C).1C). The Cas protein profile HMMs can be substituted with -hmm file.hmm and protein mode disabled with the flag --noprot. DNA and user-defined modes are disabled by default and can be activated by the flags --nucl and --seed; the search sequences are set with -query file.fasta and -define file.fasta, respectively. The number of rounds of seed expansion is defaulted to 5 and can be changed with -cycle number, while the match is set for 95% and can be changed with -match number (Figure (Figure1D).1D). Read sequence assembly and annotation are default and can be disabled by the flags --noassembly (Figure 1E) and --noannotate (Figure (Figure1F),1F), respectively. The flag --meta runs metaSPAdes in-place-of the SPAdes assembly.Open in a separate windowFigure 1.Workflow for the CasCollect pipeline. CasCollect processes an initial high-throughput sequencing read dataset (A) by read filtering (B), seed generation (C), read subset expansion (D), assembly (E) and annotation (F). (A) The sequencing read dataset requires quality scores for assess the confidence for each base call for the subsequent filtering step. (B) Read filtering can be ignored, for trimming of adapter sequences and low-quality regions, or cleaning that performs trimming and removes reads matching a reference of undesired sequence(s) that can be supplied by the user. For paired-end reads, both trimming and cleaning will merge reads with over lapping regions. (C) Seed reads can be generated by Protein mode, DNA mode and/or a user-defined read subset (dark gray boxes). Protein mode translates the reads for searching with either the built-in protein profile HMMs or a user-defined set. DNA mode searches for matches to user-defined reference sequence(s). User-defined mode allows for any subset of reads or sequences be used for seed expansion. The seed generation modes can be invoked independently or concurrently within a single run of the program. (D) The number of cycles of read subset expansion can be varied to generate larger or smaller expanded read sets. The (E) targeted assembly using this subset of reads and (F) annotation of the assembled contigs are optional for identifying cas genes and CRISPR arrays.Unassembled genomic DNA comparative analysisFor the E. coli KLY, P. aeruginosa VA-134 and S. pyogenes M39 bacteria isolates, CasCollect was run with the default parameters for a protein homology read search with the Cas HMM profiles and following parameters: --trim -cycle 2 -cpu 100 -mem 2000. The pooled simulated and metagenomic dataset was run with the following parameters: --trim --meta -cpu 100 -mem 2000. Datasets from the CDC & FDA Antibiotic Resistance Isolate Bank panel of P. aeruginosa isolates were run with similar parameters as the bacteria isolates: --trim -cycle 2 -cpu 100 -mem 2000 appended with --nucl -query Pseudomonas_aeruginosa_DK2.fas for DNA mode to search for isolated CRISPR arrays. The complete assembly used the CasCollect filtered and trimmed run through SPAdes with the same number of CPUs and amount of RAM as CasCollect. For the metagenomic and simulated metagenomic datasets, metaSPAdes was run in-place-of the SPAdes assembly (18).Progressive read collection analysisThe metagenomic dataset was run with the CasCollect pipeline with zero to five cycles of read subset expansion. Each of these read sets and the whole sequencing dataset were mapped onto the largest cas operon for each metagenome with bowtie2 with default parameters (19). The read coverage was extracted with SAMtools (20) using the depth command and -a parameter to output coverage for the full-length contig.
Article TitleCasCollect: targeted assembly of CRISPR-associated operons from high-throughput sequencing data
CasCollect is publicly available for download under the terms of the GNU General Public License version 3 at in the GitHub repository (https://github.com/sandialabs/CasCollect).