MATERIALS AND METHODSMetagenomic and metatranscriptomic data setsWe used the human gut-associated strand-specific metatranscriptomic and matched metagenomic data from Franzosa et al. (2014). The data sets were downloaded from the SRA website (SRA accession: SRR769395–SRR769540). In total, we analyzed eight sets of metagenomic and metatranscriptomic data. Each set contains three metagenomic data sets, and three metatranscriptomic data sets derived from the same human individual but were prepared using three different methods of sample preservation (frozen, ethanol-fixed, or RNAlater-fixed) (Franzosa et al. 2014). The eight individuals are X310763260 (abbreviated as X1), X311245214 (X2), X316192082 (X3), X316701492 (X4), X317690558 (X5), X317802115 (X6), X317822438 (X7), and X319146421 (X8).Assembly of CRISPR–Cas systemsIt has been shown that some species are more transcriptionally active relative to their genomic abundance (Franzosa et al. 2014). Combining metatranscriptomic data sets with metagenomic data sets therefore has the chance of improving the assembly of some CRISPR–Cas systems from rare but highly expressed species. We compared the performance of the assembly of CRISPRs using metagenomic data sets only with the assembly using both metagenomic and metatranscriptomic data sets (combined assembly). Also, we applied both the targeted assembly of CRISPRs (Rho et al. 2012) we developed (Seq2CRISPR version 0.9 available at http://omics.informatics.indiana.edu/CRISPR) and “non-targeted” de novo assembly of the microbiomes using soapdenovo2 (Luo et al. 2012) using only metagenomic or combined metagenomic and metatranscriptomic data sets. Instead of using all the reads in a sequence data set, the targeted assembly approach first extracted the reads that contain segments similar to the repeats in reference CRISPRs and then assembled only the pooled reads (so significantly reduced the data set to be assembled).Choice of assembler and k-mer size for the assemblySimilar to most de novo assemblers for short reads, the assembler we used, SOAPdenovo2, is based on de Bruijn graphs of k-mers (each is a short sequence of k nucleotides) (Compeau et al. 2011). It is therefore important to test the impact of choice of k-mer size on the assembly results of CRISPRs. In our previous targeted assembly, we used 43 as the k-mer size for targeted assembly for CRISPR arrays (Rho et al. 2012). Although this parameter works generally well in this study, we found that this parameter is less effective, especially for CRISPR arrays with long repeats such as the CRISPR associated with B. fragilis, which has the longest repeat of 47 bp. So in this study, we systematically tested the size of k-mers, and the results (see Results) show that k-mer size of 53 works generally well. We therefore applied this parameter for targeted assembly of CRISPR arrays and de novo assembly of the metagenomic, or combined metagenomic and metatranscriptomic data sets.Reference collection of CRISPR repeatsWe identified 33 CRISPR–Cas systems from 23 species that were shown to be highly expressed in the previous study (Franzosa et al. 2014). The repeats found in these CRISPR–Cas systems were used in our study as reference for the targeted assembly and characterization of CRISPRs in contigs by CRISPRAlign (Rho et al. 2012) (version 1.4 available at http://omics.informatics.indiana.edu/CRISPR/). We assigned IDs to the CRISPRs according to their associated species and other information: The ID uses five letters from the species name followed by the length of the repeats (length of 36 bp is shown as L36), and the type (subtype) information of the associated CRISPR–Cas system if it is available. For example, the CRISPR found in Lactobacillus casei ATCC 334 contains repeats of 28-bp long and is a type I-B system. It therefore is called LcaseL28-IB.Identification of new CRISPR–Cas systemsFrom de novo assembly results, we can identify the contigs that contain both putative cas loci and CRISPR arrays, contigs that contain either cas loci or CRISPR arrays, and many more contigs that do not contain any. We focus on the contigs that have both putative cas loci and CRISPR arrays considering that they are more likely to represent true CRISPR–Cas systems than other contigs containing only one of the components (although it was shown that there are CRISPRs that are distant from any cas locus).CRISPRs were predicted using CRISPAlign (Rho et al. 2012) against known CRISPR repeats (such that the predicted CRISPRs contain repeats sharing at least 90% sequence identity with one of the known repeats) for reference-based annotation, and metaCRT (Rho et al. 2012) (which we modified from CRT Bland et al. 2007 to allow partial repeats at the ends of contigs) for de novo prediction. FragGeneScan (Rho et al. 2010) was applied to predict protein-coding genes from contigs, and the predicted proteins are used to annotate putative Cas proteins using hmmscan (Zhang et al. 2014) against the collection of 156 families of Cas proteins, including the known ones from a previous study (Makarova et al. 2011), and our newly defined Cas families from the human microbiomes (using a combination of context-based and similarity-search approaches). The type of CRISPR–Cas system was assigned using type signature cas genes (Makarova et al. 2011; Chylinski et al. 2014).We then expanded the collection of CRISPR repeats from 33 reference repeats to 137 repeats. This expanded collection was used for identification of more CRISPR arrays, including those found in contigs that only contain the CRISPRs but no cas genes.Quantification of CRISPR expression and detection of transcription directionWe mapped the RNA-seq reads to the assembled contigs that contain putative CRISPR (and cas genes) using Bowtie2 (Langmead and Salzberg 2012) and then summarized the expression of CRISPRs using mapped reads. Because strand-specific RNA-seq does not usually achieve 100% strand specificity (Sigurgeirsson et al. 2014), for a CRISPR with transcription only in one direction, we may find reads suggesting transcription from the other direction as well. Similar to the statistical approach we developed for detecting antisense transcripts to CDS (Bao et al. 2015), we applied binominal tests using a success rate of 0.05 to check if the observation of transcriptions from both directions is likely to be a consequence of the imperfect strandedness of the RNA-seq experiment or is more likely to represent the bidirectional transcription of the CRISPR. Specifically, we use binomial testing to detect CRISPRs with transcripts in both directions that are unlikely to result from such artifacts: Let P be the probability of having reads from one strand even though there is no real transcription in this strand (so real transcription occurs in the opposite strand). A total of c reads are sequenced from the CRISPR (c is approximated as the number of reads that can be mapped to the array), among which m reads represent transcripts from the strand opposite to the main direction. The null hypothesis is that there is no bidirectional transcription from this CRISPR. We use the binomial test in R (binom.test) to calculate the probability of having c reads (the number of successes) out of m trials (a total of m reads) with a success rate of P. If the probability is low (≤0.05 according to one-tailed binomial test), we consider that the CRISPR has bidirectional transcription (the alternative hypothesis). We used P = 0.05, since most of the metatranscriptomic data sets have less than this ratio of antisense reads (to protein-coding genes) (Bao et al. 2015), and it was shown that most library treatments in RNA-seq have a strandedness of >95% (Sigurgeirsson et al. 2014).
Article TitleCharacterization of CRISPR RNA transcription by exploiting stranded metatranscriptomic data
CRISPR–Cas systems are bacterial adaptive immune systems, each typically composed of a locus ofcasgenes and a CRISPR array of spacers flanked by repeats. Processed transcripts of CRISPR arrays (crRNAs) play important roles in the interference process mediated by these systems, guiding targeted immunity. Here we developed computational approaches that allow us to characterize the expression of many CRISPRs in their natural environments, using community RNA-seq (metatranscriptomic) data. By exploiting public human gut metatranscriptomic data sets, we studied the expression of 56 repeat-sequence types of CRISPRs, revealing that most CRISPRs are transcribed in one direction (producing crRNAs). In rarer cases, including a type II system associated withBacteroides fragilis, CRISPRs are transcribed in both directions. Type III CRISPR–Cas systems were found in the microbiomes, but metatranscriptomic reads were barely found for their CRISPRs. We observed individual-level variation of the crRNA transcription, and an even greater transcription of a CRISPR from the antisense strand than the crRNA strand in one sample. The orientations of CRISPR expression implicated by metatranscriptomic data are largely in agreement with prior predictions for CRISPRs, with exceptions. Our study shows the promise of exploiting community RNA-seq data for investigating the transcription of CRISPR–Cas systems.