Materials and Methods

CRISPRstrand: predicting repeat orientations to determine the crRNA-encoding strand at CRISPR loci

2 MATERIALS AND METHODSWe present a linear discriminative model based on graph kernels to accurately predict the orientation of the CRISPR sequence. The method first generates a sequence alignment of all repeat instances in the CRISPR array and outputs the consensus repeat sequence in its predicted orientation and whether it lies on the forward or reverse strand. There are two core ideas underlying our approach. The first one is to use a combinatorial technique to extract a very large number of features. The second idea is to encode our knowledge about the problem as a directed graph with discrete labels. The first idea allows a predictive system to be very accurate and to express complex discriminative decisions; the second idea allows a natural and flexible encoding of background knowledge.2.1 Novel comprehensive identification of CRISPR lociWe extracted a comprehensive dataset of CRISPR loci from published archaeal and bacterial genomes. All genome sequences were downloaded from the NCBI website (http://www.ncbi.nlm.nih.gov/). We predicted CRISPR loci using CRISPRFinder (Grissa et al., 2007) and CRT (Bland et al., 2007). For both tools, we used (i) default parameter values for predicted CRISPR loci and (ii) parameters that corresponded to at least two repeats within a CRISPR locus; repeat and spacer lengths were set to a range between 18 and 78 bp. We then (iii) generated a consensus repeat for each CRISPR locus exploiting the fact that repeats within a CRISPR locus are almost completely identical with some loci that carry few mutations, preferably at the start and end (see Supplementary Figure S3, Supplementary material). Because CRT does not output consensus repeats, we used the MAFFT program (Katoh et al., 2002), version 6.4., to compute the multiple alignments and the Cons program from EMBOSS package (Rice et al., 2000) to obtain the consensus repeat from the multiple sequence alignments. Finally, (iv) the results from both CRISPRFinder and CRT tools were merged and redundant CRISPR loci were removed. In this way, we obtained a CRISPR databases with >4700 consensus repeats, which we refer to as REPEATS (see Table 1 for details). Table 1.Summary of our REPEATS dataset derived from all available CRISPR lociData statisticsArchaeaBacteriaGenomes (total)3094590(4899)Genomes with CRISPRs (%)217(70)1409(30)CRISPRs on forward strand5161810(2326)CRISPRs on reverse strand5301859(2389)Repeats per array (median)2–198(20)2–1371(16)Repeat lengths (median)20–44(29)19–48(30)Spacer lengths (median)20–54(38)19–72(35)Open in a separate window2.2 Datasets from literature2.2.1 Set of repeats from Lange et al. (2013) We selected structural motifs that fit to known cleavage sites (Lange et al., 2013). Table 2 gives a summary of published CRISPR-Cas systems with experimental evidence for the processing mechanism which we refer to as REPEATSLange. This dataset contains 324 bacterial and 118 archaeal repeat sequences (442 in total). Table 2.Summary of REPEATSLange dataset: published CRISPR-Cas systems with experimental evidence of the processing mechanismOrganismMotifCas subtypeSummaryEscherichia coli K12M2I-EStructure predicted, but stable; 8-nt-5′-tag; cleavage by Cas6e, biochemical experiments (Brouns et al., 2008)Thermus thermophilus HB8M2I-EStructured; 8-nt-5′-tag; cleavage by Cas6e; crystal structure of repeat hairpin in Cas6e (Cse3) (Gesner et al., 2011; Juranek et al., 2012; Sashital et al., 2011)Bacillus halodurans C-125M3I-CCleavage by Cas5d; 11-nt-5′-tag mutational analysis of hairpin structure (Nam et al., 2012)Pseudomonas aeruginosa UCBPP-PA14M4I-FCleavage by Cas6f (Csy4); 8-nt-5′-tag; crystal structure and mutational analyses of repeat hairpin in Cas6f (Haurwitz et al., 2010, 2012; Sternberg et al., 2012)Synechocystis sp. PCC6803M5I-DIII-variantCleavage by Cas6; 8-nt-5′-tag; biochemical experiments, extended structure prediction of hairpin motif (Scholz et al., 2013)Thermus thermophilus HB27M9I-CCleavage by Cas5d; 11-nt-5′-tag biochemical experiments (Garside et al., 2012)Methanosarcina marzei Gö1M13I-B III-BCleavage by Cas6b; 8-nt-5′-tag; structure probing experiment of hairpin (Nickel et al., 2013)Synechocystis sp. PCC6803M14III-variantBiochemical analysis of Cmr2 implicate its involvement in either cleavage, crRNA stabilization, or array expression regulation; 13-nt-5′-tag (Scholz et al., 2013)Staphylococcus epidermidis RP62AM28III-ACleavage by Cas6; 8-nt-5′-tag; hairpin structure as in M28 verified by mutational analysis and sequence specificity around cleavage site (Hatoum-Aslan et al., 2011)Methanococcus maripaludis C5M29I-BCleavage by Cas6b; 8-nt-5′-tag; biochemical experiments (Richter et al., 2012)Open in a separate windowNote: In particular, these are systems for which (i) the Cas endoribonuclease has been characterized and/or (ii) the repeat structure has been verified. Published results are consistent with the data of Lange et al. (2013).2.2.2 Set of repeats from Kunin et al. (2007) We denote the dataset originally published in Kunin et al. (2007) as REPEATSKunin. The dataset contains 327 bacterial and 92 archaeal repeat sequences (419 in total). The orientations were assigned by the authors using previously published sequence features.2.2.3 Set of archaeal repeats from Shah and Garrett (2011) We denote the dataset based on the results available in (Shah and Garrett, 2011) as REPEATSShah. This dataset contains 478 archaeal repeat sequences with manually verified strand orientation.2.3 Encoding CRISPR repeats as graphsThe features used to discriminate between the different orientation are based on available biological knowledge of CRISPR evolution and processing. During CRISPR RNA processing by Cas6-like endoribonucleases, cleavage occurs either at the 3′-end base of the hairpin motif, or within the double-stranded region of the hairpin stem, usually below a C → G base pair (Barrangou and van der Oost, 2013; Richter et al., 2012; Scholz et al., 2013). The product of this cleavage is an 8-nt-long AUUGAAA(N) repeat tag at the 5′-end of the mature crRNA (5′-tag), which corresponds to the last eight nucleotides from the 3′-end of the repeat sequence. Kunin et al. (2007) and Lange et al. (2013) showed that in some cases the four nucleotides AAA(N) motif can be used to identify the orientation. These observations lead to the hypothesis that the terminal region of the sequence, comprising four or eight nucleotides, plays a key role. We observed also that the mutation rate in various parts of the CRISPR locus is non-uniform, in particular the middle part of the CRISPR locus is more conserved. This finding motivated the idea of using the presence of mutations as an additional signal to detect the predominantly transcribed strand.We made use of this background knowledge to partition the consensus repeat into specific informative parts: we distinguish terminal regions of identical size k at both ends (as the correct orientation is unknown) and a central variable length area. The terminal sequences are further partitioned into P equally sized parts, where we expect to find key motifs. We call each part block. One of the main signals that we used to define the number and size of the blocks is the mutation rate, defined as the fraction of mutations per nucleotide in each block. In Supplementary Figure S3, we report the mutation rate for the CRISPR locus partitioning with k = 8 and P = 2 on a dataset of 897 CRISPR arrays (Kunin et al., 2007; Shah and Garrett, 2011): each repeat is split into five adjacent regions, with terminal blocks spanning exactly 4 nucleotides and a central block spanning 12 nucleotides on average. In these settings, we observed a highly significant 4-fold and a 16-fold increase in the mutation rate in the initial 8 nucleotides and in the terminal block, respectively, as compared to the middle block. In Section 3.1, we have further validated the optimality of this partitioning with in silico simulations.We encoded all our intuitions and knowledge on the relevant signals that a predictive model should be aware of in a graph data structure. The reason for this choice is 2-fold. First, we want an easy and natural way to inject different types of information in the problem solution, and, second, we want to exploit efficient techniques developed in the Machine Learning literature to automatically construct a large number of derived features to improve the accuracy of predictive models.The graph formalism allows us, in a very natural and flexible way, to add knowledge by inserting informative entities as vertices and connecting them to the relevant parts of the current encoding via the edge notion. In our case, the information provided by the consensus sequence is modelled directly as a path graph with vertices labelled with the consensus nucleotide code (see Fig. 2). We then model the global localization information as additional vertices with a label that indicates the block identity. This reveals whether a nucleotide is located at the very beginning or just near the beginning of the sequence (and symmetrically for the opposite end). Furthermore, we consider a more fine grained localization information, identifying the specific position of a nucleotide within a block. The reason to encode an increasingly refined localization information is to allow the algorithm to choose the optimal level of detail needed in various parts of the sequence. Finally, the main piece of information is whether there is evidence of a mutation at a specific location; we model this with an additional vertex labelled with a binary code to indicate the presence of a mutation in at least one of the repeated sequences. Open in a separate windowFig. 2.Graph encoding the consensus repeat sequence. The consensus nucleotide information is represented as a path graph, and additional information is modelled as a chain of additional vertices. The terminal parts of the repeat are marked with block identifiersThe final modelling decision regards the topology of the graph, i.e. how the additional vertices, which encode the different types of information, should be connected together. We identify an order which reflects the importance of the different types of information, starting from the nucleotide type, the block ID, the mutation evidence and finally the relative position within a block. Note that the combinatorial feature generation phase is affected by the sequential order of these attributes, as the information that is ranked higher will participate in the generation of more features and will therefore be regarded as more prominent.2.4 Predictive model and feature extractionAfter having encoded domain expert knowledge as a graph, we need to process this type of structured data to induce a predictive model. We do this using the technique developed by Costa and Grave (2010), based on the notion of graph kernels. The core idea (see Supplementary Information for a formal description) is to decompose each graph in a (multi) set of fragments and use these as features, in a similar fashion to what is done in the chemoinformatics domain with the fingerprint technique. The resulting sparse vectors can then be processed by efficient machine learning techniques, such as the stochastic gradient descent SVM (Bottou, 2010), to yield fast and highly predictive models. The type of graph decomposition that we use is called Neighbourhood Subgraph Pairwise Distance Kernel (NSPDK), and it involves the extraction of all possible pairs of small neighbourhood subgraphs that are not too distant (see Fig. 3). Intuitively one can think about this type of decomposition as an upgrade of the concept of k-mers with gaps from the domain of strings to that of graphs. Both the extraction of the features and the training of the predictive model have linear complexity and offer therefore excellent scaling capability. More precisely, extracting all neighbourhood subgraphs is achieved with a breadth-first visit for a limited depth starting from each node, and as the graphs are sparse, it takes O(|n*m|) where n is the number of nucleotides and m the number of repeat alignments. Open in a separate windowFig. 3.The NSPDK approach extracts a large number of features taking only specific fragments into account. The procedure is parametrized by the radius R and the distance D. Each vertex is considered in turn as a root. A neighbourhood graph of radius R is extracted around each root. All possible pairs of neighbourhood graphs of the same size R are considered, provided that their respective roots are exactly at distance D. To understand the importance of the sequential order of the attributes consider the left part of the figure: here we depict a feature with radius 1 and distance 0, which will encode three pieces of information: (i) the specific dinucleotide combination, (ii) the block ID and (iii) whether a mutation is likely to occur on the first nucleotide of the dinucleotide. As we increase the maximal distance between the roots in the pair, the encoded information is further specialized. In the middle part of the figure, we show a feature that additionally includes the presence of a mutation at distance 5. When the radius is increased to 2, the specific position within the block is also consideredFinally, given that one of the two strands can be the one that exhibits a characteristic pattern, we train a predictive model on both variants of each repeat sequence: one obtained from the forward strand and the other from the complementary reverse strand. The binary task is therefore to assign a positive score to the sequences that are transcribed and a negative one to the complementary strand. In the predictive phase, we enforce consistency by considering the prediction on both variants of the sequence: a strong confidence of the prediction of the forward strand should also correspond to an equally confident prediction that the reverse complementary sequence is not transcribed. To do so, we simply perform the individual predictions and then average the prediction of the forward strand with the opposite prediction for the reverse strand. If the resulting score is positive, then the forward strand is predicted to be transcribed, whereas the reverse strand is selected if the score is negative.

Article TitleCRISPRstrand: predicting repeat orientations to determine the crRNA-encoding strand at CRISPR loci

Abstract

We extracted a comprehensive dataset of CRISPR loci from published archaeal and bacterial genomes. All genome sequences were downloaded from the NCBI website (http://www.ncbi.nlm.nih.gov/). We predicted CRISPR loci using CRISPRFinder (Grissaet al., 2007) and CRT (Blandet al., 2007). For both tools, we used (i) default parameter values for predicted CRISPR loci and (ii) parameters that corresponded to at least two repeats within a CRISPR locus; repeat and spacer lengths were set to a range between 18 and 78 bp. We then (iii) generated a consensus repeat for each CRISPR locus exploiting the fact that repeats within a CRISPR locus are almost completely identical with some loci that carry few mutations, preferably at the start and end (seeSupplementary Figure S3,Supplementary material). Because CRT does not output consensus repeats, we used the MAFFT program (Katohet al., 2002), version 6.4., to compute the multiple alignments and the Cons program from EMBOSS package (Riceet al., 2000) to obtain the consensus repeat from the multiple sequence alignments. Finally, (iv) the results from both CRISPRFinder and CRT tools were merged and redundant CRISPR loci were removed. In this way, we obtained a CRISPR databases with >4700 consensus repeats, which we refer to as REPEATS (seeTable 1for details).


Login or Signup to leave a comment
Find your community. Ask questions. Science is better when we troubleshoot together.
Find your community. Ask questions. Science is better when we troubleshoot together.

Have a question?

Contact support@scifind.net or check out our support page.