Materials and Methods

CRISPR Spacers Indicate Preferential Matching of Specific Virioplankton Genes

MATERIALS AND METHODSCASC pipeline. The CASC pipeline can be broadly divided into two parts (Fig. S1), (i) preliminary search for putative CRISPR spacers and (ii) validation of putative CRISPR arrays by Cas protein homology, CRISPR repeat homology, and the statistical characteristics of spacer sizes. The preliminary search for CRISPR arrays employs a modified version of the CRT (16). Modifications included a reformatting of the search output, improved handling of multi-FASTA files, and the ability to utilize multiple central processing units (CPUs) to lessen computational run time. These modifications improved the ability of CRT to analyze large metagenomic data sets. Putative CRISPR arrays are then validated and deemed “bona fide” CRISPRs if any of the following conditions are met: (i) the sequence containing the candidate CRISPR array has a BLASTx match (E value ≤ 1e−12) to a known UniRef 100-Cas protein cluster (43), (ii) the candidate CRISPR repeat had a BLASTn match (E value ≤ 1e−5; word size = 4) to a known CRISPR repeat from the CRISPRdb reference database (7), or (iii) the standard deviation of spacer length within the candidate CRISPR array was ≤2 bp. CASC offers “conservative” and “liberal” CRISPR validation modes. In conservative mode, conditions (i) or (ii) must be met, while under liberal mode conditions, (i), (ii), or (iii) may be met. CASC is available on GitHub (https://github.com/dnasko/CASC).FIG S1The CASC workflow. (A) Preliminary search for CRISPR arrays and identification of putative spacer arrays. (B) Validation of putative spacers. Download FIG S1, PDF file, 0.08 MB.Copyright © 2019 Nasko et al.This content is distributed under the terms of the Creative Commons Attribution 4.0 International license.FIG S2Nucleotide position histogram of CRISPR repeats from CRISPR repeats deemed bona fide by CASC (A), all CRISPR repeats from CRISPRdb (B), and CRISPR repeats deemed non-bona fide by CASC (C). Download FIG S2, PDF file, 0.03 MB.Copyright © 2019 Nasko et al.This content is distributed under the terms of the Creative Commons Attribution 4.0 International license.Simulated metagenome construction. Two shotgun sequence simulations were generated using Grinder (version 0.5.0) (44) for the purpose of validating CASC and assessing performance. Ten complete bacterial genomes were selected for the simulated metagenomes (Table S1), five of which contained CRISPR arrays. The first simulation generated 60 million paired-end 150-bp Illumina reads (read_dist = 150 normal 0; insert_dist = 300; mutation_dist=poly4), and the second simulation generated one million 454 pyrosequencing reads (read_dist = 450 normal 50; mutation_dist=poly4).TABLE S1Bacterial genome sequences used in the construction of the mock metagenomes. Download Table S1, PDF file, 0.04 MB.Copyright © 2019 Nasko et al.This content is distributed under the terms of the Creative Commons Attribution 4.0 International license.TABLE S2CRISPR finding tool performance spacers found in the artificial 454 pyrosequencing metagenome using available CRISPR discovery tools. Download Table S2, PDF file, 0.04 MB.Copyright © 2019 Nasko et al.This content is distributed under the terms of the Creative Commons Attribution 4.0 International license.TABLE S3Spacers found in the artificial Illumina metagenome using available CRISPR discovery tools. Download Table S3, PDF file, 0.04 MB.Copyright © 2019 Nasko et al.This content is distributed under the terms of the Creative Commons Attribution 4.0 International license.The Illumina simulated read pairs were assembled using the St. Petersburg genome assembler (SPAdes) version 3.5.0, using all default settings (19), with the exception of bypassing the preassembly read error correction process. The 454 simulated reads were not assembled, and CRISPRs were predicted directly from the reads.Performance validation. The known CRISPR array positions in five of the 10 genomes were used to assess the performance (i.e., sensitivity and precision) of several CRISPR identification algorithms. Alignment of the Illumina-assembled contigs against the reference genomes identified the position of each CRISPR locus on the contigs and indicated that all spacers were successfully assembled. The alignment-generated CRISPR positions on the contigs were then used as the known CRISPR array positions. CRISPR array positions within the 454 reads were determined using the genome coordinates provided by Grinder.Several algorithms, including CASC version 2.5 and the default settings of metaCRT (a version of CRT modified by Rho and colleagues) (45), PILER-CR (version 1.06) (20), and CRISPRFinder (21), were used to predict CRISPR arrays from the Illumina-assembled contigs and 454 reads (Tables S2 and S3). The predicted spacers from each program were clustered with the set of known spacers using cd-hit-est (version 4.6) (46). Those spacers clustering at 100% identity with a known spacer were counted as a true positive.To better measure the abundance of spacers in the simulated Illumina metagenome, a recruitment of the simulated Illumina reads to assembled SPAdes contigs was performed using Bowtie2 (version 2.1.0) (47). The coverage of each spacer was calculated using SAMtools (version 1.2-2-gf8a6274) (48) and used to estimate the number of spacer copies present in the simulated Illumina metagenome.Spacer predictions in GOS and Tara Oceans microbial metagenomes. The Global Ocean Sampling (GOS) and Tara Oceans expeditions sampled and sequenced microbial DNA from across the world’s oceans (17, 18). The GOS data set was ideally suited for CRISPR prediction, as the long-read technology used for sequencing these libraries was capable of encoding intact CRISPR arrays (49), and this data set has been used in previous studies of CRISPR prediction from metagenomic data (50, 51). GOS sequences were downloaded from iMicrobe (https://www.imicrobe.us/) and included the GOS I expedition, GOS Baltic Sea, and GOS Banyoles (Data Set S1). CRISPR spacers were predicted from 157 GOS sequence libraries totaling ca. 39 million reads and containing ca. 21 Gbp of genomic DNA from microorganisms typically between 0.1 and 0.8 μm in size (note that filter sizes ranged from 0.002 to 20 µm based on sample site) with CRISPR calling in liberal mode.The Tara Oceans expedition was a global-scale oceanic study that sampled and sequenced metagenomes from 67 sites (52). In addition to sampling nearly every site at various depths, several sites were processed with multiple filter sizes (ranging from 0.2 to 3.0 µm), including 54 sites with paired microbial and viral fractions, making the Tara Oceans data set ideal for linking bacterial spacers with their viral gene targets in the viromes. Tara Oceans metagenomes were predominantly sequenced using Illumina HiSeq platform (100-bp paired-end reads). Because Illumina reads are too short for accurate searches of spacer arrays, assembled contigs were used instead (ca. 58 million contigs totaling 62 Gbp). Tara Oceans assembled contigs were obtained from the European Nucleotide Archive (http://www.ebi.ac.uk/ena/about/tara-oceans-assemblies).In addition to counting the number of spacers found within each Tara Oceans contig, it was necessary to calculate the abundance of each spacer by recruitment of the original library of unassembled Illumina reads to Tara Oceans contigs. The reads corresponding to each assembly were downloaded from NCBI’s Sequence Read Archive and recruited to their assembled contigs using Bowtie2 (very sensitive local setting). The read coverage of each spacer was calculated using SAMtools and used as a proxy for the number of copies of each spacer.To measure how novel these spacers were, the GOS and Tara Oceans spacers were clustered with known spacers from the CRISPRdb at 98% identity using cd-hit-est (7, 46).Microbial community profiles with respect to CRISPR abundance. The Tara Oceans observed OTUs “16S OTU Table” from Sunagawa et al. (22) was downloaded from http://ocean-microbiome.embl.de/companion.html and imported into QIIME (53). OTUs occurring ≤2 times were filtered out, and 100 jackknife subsamples were created with 35,461 observations (90% of the smallest sample) in each. The community similarity test was performed with beta_diversity.py using Bray-Curtis. Per-OTU correlations were calculated for each depth zone after splitting the BIOM file accordingly and using observation_metadata_correlation.py. Only correlations with a Mantel's or Pearson’s r of ≥0.3 or ≤−0.3 with a P value of ≤0.05 (with Bonferroni correction) were considered significant.Identification of GOS and Tara Oceans spacer targets. Putative CRISPR spacers from the GOS and Tara Oceans microbial metagenomes were searched against Tara Oceans viromes (Data Set S3) and a subset of publicly available aquatic viromes (Data Set S4) available on Viral Informatics Resource for Metagenome Exploration (VIROME http://virome.dbi.udel.edu) (54) to identify candidate viral gene targets. Only spacers found with CASC in conservative mode were used for this analysis to reduce the likelihood of identifying spurious spacers.DATA SET S3Summary of Tara Oceans viromes. Download Data Set S3, XLSX file, 0.07 MB.Copyright © 2019 Nasko et al.This content is distributed under the terms of the Creative Commons Attribution 4.0 International license.DATA SET S4Summary of aquatic viromes collected from VIROME (http://virome.dbi.udel.edu). Download Data Set S4, XLSX file, 0.05 MB.Copyright © 2019 Nasko et al.This content is distributed under the terms of the Creative Commons Attribution 4.0 International license.DATA SET S5Actual versus expected number of annotations for candidate microbe-virus protospacers. Download Data Set S5, XLSX file, 0.01 MB.Copyright © 2019 Nasko et al.This content is distributed under the terms of the Creative Commons Attribution 4.0 International license.DATA SET S6Actual versus expected number of annotations for candidate virus-virus protospacers. Download Data Set S6, XLSX file, 0.01 MB.Copyright © 2019 Nasko et al.This content is distributed under the terms of the Creative Commons Attribution 4.0 International license.Sequence alignment cutoffs used in previous studies comparing microbial spacers to virome genes have varied both in stringency and cutoff metric, depending on the aim of the study. When identifying host-phage interactions by linking specific viral population(s) to CRISPR spacers/loci, more stringent cutoffs are applied, such as requiring a 100% nucleotide identity alignment of ≥20 bp (11) or an alignment with no more than one mismatch (55). Exploratory studies trying to link what, if any, similarities exist between microbial spacers and virome genes have used more relaxed cutoffs, such as an E value of ≤1e−3 (10) or alignments containing up to 15 mismatches (56).As the objective of this study was to determine if particular viral genes were more likely to be targeted by the CRISPR system of marine bacterioplankton, cut-offs commonly used in exploratory studies were used. Spacer sequences are highly diverse and hypervariable, even between closely related species (57), making it challenging to identify candidate viral gene targets at the nucleotide level. Thus, when searching for potential viral gene targets in viromes, some mismatches and gaps in the nucleotide alignment were permitted using BLASTn (version 2.2.30+; E value ≤ 1e−1; word size = 7). This resulted in 51% of high-scoring segment pairs (HSPs) with no mismatches and 89% of HSPs with no gap openings (Fig. S3).FIG S3Alignments between spacers and viral ORFs were typically strong. (A) Nearly 95% of HSPs had 3 or fewer mismatches in alignments of spacers to viral ORFs. (B) Nearly 98% of HSPs had 1 or no gaps open in alignments between spaces and viral ORFs. Download FIG S3, PDF file, 0.02 MB.Copyright © 2019 Nasko et al.This content is distributed under the terms of the Creative Commons Attribution 4.0 International license.In this analysis, some spacers matched CRISPR arrays within several viromes. To limit these spurious matches, CASC (liberal mode) was used to identify putative spacer arrays within the viromes. Subsequently, sequences containing an array were removed from the aquatic virome database prior to the analysis to identify viral gene targets.Spacer sequences were searched against the virome database with BLASTn. Virome sequences that aligned with spacers were then culled into a separate FASTA file, and open reading frames (ORFs) were predicted using MetaGene (58). ORFs were predicted after the spacer search to detect any spacers that may have spanned virome ORFs (a rare occurrence). Virome ORFs with a match to a spacer were translated and searched against Phage SEED (version 01-May-2016; http://www.phantome.org) using BLASTp (version 2.2.30+; E value ≤ 1e−3). Each ORF was annotated using the best cumulative bit score, which is described in the next section.Finally, great-circle distances between microbial metagenome spacers and VGTs within viromes were calculated in R (59) using the geosphere package (60). Distance distributions were rendered in violin plots using the R package vioplot.Annotating virome ORFs and calculating expectation. Virome ORFs with a match to a spacer were translated and searched against Phage SEED (version 01-May-2016; http://www.phantome.org) using BLASTp (version 2.2.30+; E value ≤ 1e−3). A virome ORF was annotated to be the gene function producing the highest cumulative bit score. For example, if “ORF_1” hit 10 Phage SEED genes, eight of which were hits to phage protein and the total bit score of these alignments was 50, while the two remaining hits were to terminases with a total bit score of 100, “ORF_1” would be assigned to terminase. ORF annotation counts were generated for the virome ORFs matching microbial (Data Set S5) and virome (Data Set S6) spacers.To put these counts in come context, all aquatic virome ORFs were run through the same Phage SEED-based annotation pipeline. Counts for all virome ORFs were tabulated, and the frequency of occurrence for each gene type was calculated. The expected number of genes to have matches to CRISPR spacers was calculated by multiplying the total number of genes matching spacers by the frequency of that gene being annotated in all aquatic viromes.Data availability. Scripts used in this analysis are available on GitHub (github.com/dnasko/CASC) under the GNU General Purpose License.Six data sets were used in this analysis. The first two were simulated metagenomic data sets and are available at Zenodo (https://doi.org/10.5281/zenodo.1650429). The second two data sets were shotgun metagenomic reads from the Global Ocean Survey (GOS) and Tara Oceans survey. GOS sequences were downloaded from iMicrobe (imicrobe.us) and included the GOS I expedition, GOS Baltic Sea, and GOS Banyoles (Data Set S1). Tara Oceans assembled contigs were obtained from the European Nucleotide Archive (http://www.ebi.ac.uk/ena/about/tara-oceans-assemblies). The fifth data set was a subset of publicly available aquatic viromes (Data Set S4) available on the Viral Informatics Resource for Metagenome Exploration (VIROME; http://virome.dbi.udel.edu). Finally, the Tara Oceans observed OTUs “16S OTU Table” from Sunagawa et al. (22) was downloaded from http://ocean-microbiome.embl.de/companion.html.

Article TitleCRISPR Spacers Indicate Preferential Matching of Specific Virioplankton Genes

Abstract

Viral infection exerts selection pressure on marine microbes, as virus-induced cell lysis causes 20 to 50% of cell mortality, resulting in fluxes of biomass into oceanic dissolved organic matter. Archaeal and bacterial populations can defend against viral infection using the clustered regularly interspaced short palindromic repeat (CRISPR)-associated (Cas) system, which relies on specific matching between a spacer sequence and a viral gene. If a CRISPR spacer match to any gene within a viral genome is equally effective in preventing lysis, no viral genes should be preferentially matched by CRISPR spacers. However, if there are differences in effectiveness, certain viral genes may demonstrate a greater frequency of CRISPR spacer matches. Indeed, homology search analyses of bacterioplankton CRISPR spacer sequences against virioplankton sequences revealed preferential matching of replication proteins, nucleic acid binding proteins, and viral structural proteins. Positive selection pressure for effective viral defense is one parsimonious explanation for these observations. CRISPR spacers from virioplankton metagenomes preferentially matched methyltransferase and phage integrase genes within virioplankton sequences. These virioplankton CRISPR spacers may assist infected host cells in defending against competing phage. Analyses also revealed that half of the spacer-matched viral genes were unknown, some genes matched several spacers, and some spacers matched multiple genes, a many-to-many relationship. Thus, CRISPR spacer matching may be an evolutionary algorithm, agnostically identifying those genes under stringent selection pressure for sustaining viral infection and lysis. Investigating this subset of viral genes could reveal those genetic mechanisms essential to virus-host interactions and provide new technologies for optimizing CRISPR defense in beneficial microbes.


Login or Signup to leave a comment
Find your community. Ask questions. Science is better when we troubleshoot together.
Find your community. Ask questions. Science is better when we troubleshoot together.

Have a question?

Contact support@scifind.net or check out our support page.