Materials and Methods

High viral abundance and low diversity are associated with increased CRISPR-Cas prevalence across microbial ecosystems

Computational pipeline

An overview of the computational pipeline is provided in Figure S1.

Collection of samples and assembly

Paired-end metagenomic libraries (sequenced with Illumina platform) were downloaded from the NCBI SRA database (Table S1). Libraries were processed with BBMap suite of tools for error correction 12 and for each metagenomic library a representative FASTA file was created by combining both the merged and unmerged reads. A total of 332 libraries (each library containing at least 1 million reads with a minimum length of 100 nucleotides (nt) and insert size >=150) were selected to represent a wide range of biome diversity (table S1). Libraries were assembled using MegaHit (version 1.1.3) 13 with default parameters and contigs with minimum length 200nt were retained. Contigs were classified as archaea, bacteria or virus using a purpose-built classification tool: MIUMS (Microbial Identification Using Marker Sequence, MIUMS is designed to classify contigs based on a reference database containing protein sequence fragments highly specific to bacteria, archaea, viruses. Full details of MIUMS reference database construction and prediction process are given in the supplementary information. Each metagenome was then subsampled to 1 million randomly selected reads with a minimum length of 100 nt.

Generation of archaeal and bacterial abundance tables

Subsampled reads were screened using metaxa2 14 (GSU parameters: -g ssu -f f, LSU parameters: -g lsu -f f) to generate a table of bacterial and archaeal abundances. Both the LSU and SSU based methods were used. A reference sequence database was made from contigs classified as viral by MIUMS. Subsampled reads were then mapped to the assembled contigs using Magic-BLAST 15 (parameters: -no_unaligned -no_query_id_trim -perc_identity 95 -outfmt tabular). Reads with a minimum of 95% sequence identity and coverage were used.

CRISPR array identification

Accurately identifying CRISPR arrays in metagenomic data is challenging for a number of reasons. Firstly, a large proportion of the direct repeats (DRs) identified from metagenomic contigs often show little sequence similarity to the CRISPR repeats found in published genomes and lack an isolated representative 2. Secondly, CRISPR arrays found in metagenomic reads are generally short (i.e. < 3 DRs) and often missing one or both flanking regions. To overcome these issues we combined information on existing, published genomes and their CRISPR arrays as well de novo extraction of putative CRISPR arrays from our assembled contigs (Fig. S1). A database of metagenomic CRISPR arrays was first constructed by processing all assembled contigs with CRISPRDetect version 3 (CRISPRDetect3, CRISPRDetect version 3 was also modified to allow prediction of shorter CRISPRs (e.g. partial/broken CRISPRs with as little as 1.5 repeats). A higher CRISPR likelihood score cut-off of 4.5 was used instead of the default score cut-off of 3 to reduce potential non-CRISPR arrays. CRISPRDetect 16 uses several CRISPR elements (e.g. repeats, spacers, cas genes, AT composition of flanking regions etc.) from published genomes to identify and separate true CRISPRs from other genomic repeats. In this study, a modified version of the CRISPRDetect tool was used, which uses a reference repeat database created using the cluster representative DRs from the metagenomic contigs as well as DRs found in published genomes. Predicted CRISPR arrays were checked to ensure that the total array degeneracy (i.e. number of insertion, deletion, mutation or presence of Ns in the array) was less than the total number of DRs in the array, which resulted in 51395 CRISPR arrays. Direct repeat sequences (23 to 60 nt) were extracted and clustered with cd-hit-est (parameters: -n 3 -c 0.90 -aL 0.90 -aS 0.90) 17, which resulted in 33745 repeat clusters. A vast majority of these clusters (i.e. 22808) consisted of a single DR. This low level of redundancy was important to ensure the successful subsequent mapping of reads to DRs.

Subsampled reads were then screened against this database using metaCRISPRDetect (, which supports rapid identification of CRISPR arrays in short reads using user-provided reference repeat database as an extension of CRISPRDetect 16. Arrays with a likelihood score > 3 were added to the existing CRISPR reference database. Subsampled reads were then mapped to the reference database using blastn 18. Abundance tables were generated for all spacers and direct repeat sequences by tallying the number of reads that mapped to a given DR with 100% sequence identity and coverage.

Identifying potential false positive CRISPRs

CRISPRs predicted from metagenomes are generally short, often incomplete and missing flanking regions which makes it hard to distinguish true CRISPRs from other genomic repeats. To measure how many of our identified arrays occurred in known prokaryotic genomes we compared the metagenomic CRISPR repeats against CRISPRs found in RefSeq and GenBank prokaryotic sequences (sequences published before September 9, 2019) using NCBI blast (parameters: -task blastn -wordsize 11 -dust no -culling_limit 1 -num_alignments 1) 19. Metagenomic DRs with >=90% identity and >=90% sequence coverage against RefSeq or GenBank DRs were considered as a positive match. Out of the 7562 repeat clusters 1649 were found in CRISPRs predicted from RefSeq or GenBank prokaryotic sequences. Similarly, the metagenomic repeats were screened against an _in silico generated set of eukaryotic reads from 6 eukaryotic reference genomes (Table S2). Ten thousand 250bp paired reads were generated from each reference genome using WGSIM ( Eukaryotic reads were subsampled to an equal depth to remove differences due to genome size. Out of the 7562 DR clusters a total of 168 repeat clusters were found in the eukaryotic reads suggesting a high level of eukaryotic sequence contamination may increase the false positive rate of our analysis. We therefore removed samples where > 10% of reads were classified as of eukaryotic origin.

Virome analysis

To assess the diversity of viruses in the different environments, we analysed all MIUMS classified contigs from the 332 metagenomes. Across all datasets, we identified 2583 archaeal prophages, 375179 bacterial prophages, 1218279 bacteriophages, and 174792 archaeal viruses. Per metagenome, the sub-sampled reads used for the CRISPR quantification analysis were mapped to the total set of all viral contigs with bwa mem 20. To quantify the genetic diversity of the viral community, we calculated the richness as the total number of detected viral contigs, plus the evenness and Shannon’s diversity index based on relative abundances calculated from the mean depth of coverage of all detected viral contigs. The intra-population diversity (micro-diversity) was calculated as the mean heterozygosity of viruses in the community by averaging Nei’s per-nucleotide diversity index across all detected contigs 21. where pj is the frequency of allele j at the position i of a contig of the length L, N is the number of viruses in a dataset. Single nucleotide variability was assessed with VarScan software 22.

Viral abundance was calculated by first collecting the average depth values for all viral contigs in each sample using the ‘jgi_summarize_bam_contig_depths’ script from the Metabat package 23. These depth scores were then summed per sample to give an approximation of overall viral abundance relative to bacterial abundance.

Statistical analysis

For each sample, the number of reads that mapped to a direct repeat were counted to give a measure of CRISPR array abundance per sample. Earth Microbiome Project Ontology levels were assigned using the framework previously described 24(Fig 1D) based on the associated metadata from the NCBI short read archive (SRA). In the case of bioreactor samples, a literature search was conducted to identify the original material described in the associated study (see Table S1).

General linear models (GLM) were constructed for each of the reported correlations. In each case log10 transformations were applied to conform to model assumptions. Checks of model residuals were performed to assess model fit. Significance was determined using F-tests between null models and those containing the variable or interaction of interest.

Microbial community assessment

CRISPR is more common in archaea. Therefore, in order to minimise any phylogenetic effects deriving from high archaeal abundances in samples we estimated the number of archaeal reads in the subsampled reads using metaxa2 14 and converted this to relative abundance per sample. We then included this value as a fixed effect in additional GLMs to check the influence of archaeal abundance on our results. For additional phylogenetic assessments, reads classified as non-prokaryotic were removed and relative abundances were generated using the output from metaxa2 in order to assess effects of community composition. Permutational ANOVAs were performed on species abundance matrices using Bray-Curtis dissimilarity and CRISPR abundance as explanatory variable with 9999 permutations, using the function ‘adonis’ from ‘vegan’ package 25. Visualisation of clustering analyses, at the class level, was performed using non-metric multidimensional scaling (NMDS) with Bray-Curtis dissimilarity through the ‘metaMDS’ function in ‘vegan’ 25. Smooth surfaces were fit to these points via a generalized additive model (GAM), using either CRISPR abundance as the explanatory term, with the ‘ordisurf’ function in ‘vegan’25.

Article TitleHigh viral abundance and low diversity are associated with increased CRISPR-Cas prevalence across microbial ecosystems


CRISPR-Cas are adaptive immune systems that protect their hosts against viruses and other parasitic mobile genetic elements. Consequently, selection from viruses and other genetic parasites is often assumed to drive the acquisition and maintenance of these immune systems in nature, but this remains untested. Here, we analyse the abundance of CRISPR arrays in natural environments using metagenomic datasets from 332 terrestrial, aquatic and host-associated ecosystems. For each metagenome we quantified viral abundance and levels of viral community diversity to test whether these variables can explain variation in CRISPR-Cas abundance across ecosystems. We find a strong positive correlation between CRISPR-Cas abundance and viral abundance. In addition, when controlling for differences in viral abundance, we found that the CRISPR-Cas systems are more abundant when viral diversity is low. We also found differences in relative CRISPR-Cas abundance among environments, with environmental classification explaining ∼24% of variation in CRISPR-Cas abundance. However, the correlations with viral abundance and diversity are broadly consistent across diverse natural environments. These results indicate that viral abundance and diversity are major ecological factors that drive the selection and maintenance of CRISPR-Cas in microbial ecosystems.

Login or Signup to leave a comment
Find your community. Ask questions. Science is better when we troubleshoot together.
Find your community. Ask questions. Science is better when we troubleshoot together.

Have a question?

Contact or check out our support page.