Materials and Methods

CRISPRloci:comprehensive and accurate annotation of CRISPR–Cas systems

MATERIALS AND METHODSInput CRISPRloci offers four different modes of operation, depending on the elements to be annotated. Thus, protein, genomic DNA, CRISPR repeats or viral sequences are accepted (see Figure ​Figure11 and Supplementary Table S2 in supplementary materials). The Genome DNA mode is the most comprehensive one and screens a prokaryotic genome for CRISPR arrays, determining also their orientation and associated leader sequences. Moreover, it will identify the cassette boundaries, and within these boundaries, the Cas proteins together with their subtype classification. There are three sets of parameters available that enable the user to fine-tune the predictions of both CRISPR arrays and cas genes. In addition, all the parameters feature tool-tips. The second mode requires a set of prokaryotic protein sequences as input. Our method is sufficiently fast to screen an entire proteome. It identifies and classifies Cas proteins, and detects cassette boundaries if protein sequences are provided in the correct order. The third mode accepts one or more CRISPR repeat sequences and identifies the repeat orientation and subtype. Additionally, a search against integrated databases finds regions of local similarity between the input sequences and the list of bona-fide consensus repeats. The fourth mode requires the upload of a complete or partial viral/phage genome. It analyses host-viral connections by reporting how many spacers potentially originated from the input viral genome.Open in a separate windowFigure 1.The workflow of CRISPRloci. The workflow supports 4 different types of input. If DNA is picked as the input, CRISPRlociwill identify the CRISPR arrays, predict their orientation and the Leader sequence and then extract the repeat and spacer sequences. Repeat sequences are then analyzed for their structural stability while spacers are used to identify the potential regions of self targeting. If protein sequences are submitted as input, CRISPRloci will classify and report the protein type and role. The user can optionally input a set of repeat sequences. In this scenario, CRISPRlociwill perform a search of similar repeat sequences in the existing database. The user will be provided with the hits as well as their region, similarity and e-value. Lastly, the user can provide viral DNA as the input. In this scenario, CRISPRlociwill perform a search for the protospacers using a database of spacers. The user will be provided with the protospacer coordinates as well as the description of the host CRISPR arrays.Detection of CRISPR arraysThe task of correctly detecting CRISPR-array poses two main difficulties. The first problem lies in the correct identification of the CRISPR-array representation, i.e. the array boundaries and the repeat sequence. Once an array-like structure is detected, the second problem is to distinguish a bona-fide CRISPR-array from repetitive structures resembling a pseudo CRISPR-array. In our approach we rely on CRISPRidentify (14) for both tasks.To overcome the first challenge, CRISPRidentify utilizes consecutive enhancement steps to build multiple candidate representations for each potential CRISPR-array region (See Supplementary Table S1 and Figures S2–S8 in Supplementary materials for the comparison with the other tools).To pick the best representation and simultaneously filter out false candidates, CRISPRidentify utilizes a data driven ML based approach. First, it transforms each candidate into a feature vector, where each feature represents a biological property such as repeat length, number of mismatches between repeats, or similarity of spacers etc. Afterwards, the candidate is classified based on the pre-trained ML model. This approach enables the generation of a certainty score for each candidate, and therefore assess the confidence level. After the CRISPR-array extraction, the orientation is predicted using CRISPRstrand (15). Finally, we enrich the identified array with information about the leader sequence using CRISPRleader (6).It is well known that the secondary structure motif of the CRISPR repeat is essential for the generation and loading of crRNAs in many CRISPR–Cas systems. Therefore, after building the set of CRISPR-arrays, we analyze the structural stability profiles for the repeats in each CRISPR-array. First, we use RNAfold (16) to measure the Minimum Free Energy (MFE) of the consensus repeat. Next, we minimize the contribution of long-range base pairs, which are unreliable via a local folding approach for determining base pair probabilities (8). Afterwards, we compute a local structure prediction on the entire CRISPR-array using RNAplfold (17) with the windows-size and based-pair-span parameters (W = 150 and L = 80, respectively). In addition, we used the option –noLP to disallow lonely base pairs, which usually improves the prediction quality.Finally, we explore CRISPR self-targeting and alternative functions of CRISPR–Cas systems that extend beyond adaptive immunity. CRISPRloci detects the possibilities of the self-targeting spacers in a given genome of interest. To identify positional self-targeting spacers, we extract all spacer sequences from each CRISPR-array and scan for exact or partial matches between the spacer and a portion of the genomic sequence that is not part of CRISPR-arrays. Furthermore, we classify the context of the match as mobilome or non-mobilome genes to provide information about the possible evolutionary origin.Boundaries of Cas CassetteIn the CRISPR research field, the identification of cassette boundaries plays an essential role in detecting the cassettes, as new unknown cas genes must be disentangled from random genes bordering the locus.We introduced the first tool (named Casboundary (18)) that is able to define, based on ML, the cassette boundaries in an automatic manner. Casboundary assumes that the relation between the signature gene (i.e. the main gene used to define a cassette) and any other member of the same cassette is stronger than the relation of the signature gene and any non-member. In particular, we trained two predictive models, using Extremely Randomized Trees (ERT) and Deep Neural Networks (DNN), to classify if signature genes and candidate genes belong to the same cassette (positive relation) or not (negative relation). Given a genome of interest, for each signature gene found on the genome, the tool defines a potential CRISPR region by considering an interval of k genes downstream and k genes upstream to the signature gene (default: k = 50). Next, the induced models are employed to predict the label for the relation between the signature gene and all genes in the potential region. The boundary is specified as the maximal sub-region formed by a list of consecutive genes, such that the first and last gene have positive relations with the signature gene and no more than three consecutive genes with negative relations are permitted.In the experiments carried out, Casboundary displayed a score of 0.86 for the Jaccard Similarity (JS), which measures the overlap rate between the true and predicted cassettes. On the other hand, CRISPRCasFinder (19), the most similar tool to Casboundary available in the literature, achieved a JS score of 0.70.Classification of Cas proteins and cassette modularizationConsidering the high variability of the Cas protein sequences, their classification using only standard methods, such as sequence homology or Hidden Markov Models, cannot be easily accomplished. Therefore, we used Casboundary to classify Cas proteins according to the known core and signature families. For this task, Casboundary combined features of protein properties with evidence extracted from Cas Hidden Markov Models. Based on probabilities that are assigned to a protein to belong to each known Cas family, Casboundary was also able to detect proteins that may belong to new putative Cas families.After classifying the Cas proteins of the identified cassettes, Casboundary applies a decomposition step that annotates the typical functional modules (adaptation, processing or interference) contained in the cassettes.Classification of cassettes and prediction of missing proteinsThe classification of a cassette subtype is based on the combination of the Cas proteins that it contains (4,11,20). To perform such a task, our CRISPRcasIdentifier tool (21) represents the input cassettes as multidimensional vectors, where each feature corresponds to a different Cas protein family, and each value refers to the normalized bit score of each Cas protein family. Thus, we use the different normalized bit scores as evidence that a specific Cas protein is contained in a cassette. Next, CRISPRcasIdentifier proceeds to the classification step, which allows the use of three ML algorithms for the induction of classifiers, as follows: CART Decision Tree Algorithm (22), Support Vector Machines (23) and Extremely Randomized Trees (24). During our analysis, we observed that the classifiers correctly identified signatures composed by either one or more genes to determine the cassette subtypes. Such signatures represent the main information that guides the categorization manually performed by experts. CRISPRcasIdentifier can also predict potentially missing proteins in the input cassettes, based on the remaining proteins. This task is performed by a set of regressors trained to predict the normalized bit scores of each Cas family. As a result, it provides evidence for a detailed investigation by the researchers to annotate the missing protein(s). CRISPRcasIdentifier was compared to five other popular tools from the literature (two webservers and three command-line tools) on the largest public CRISPR benchmark dataset (5). In this analysis, our tool obtained an F-score and balanced accuracy of of 0.91 and 0.89, respectively. On the other hand, the best performances obtained by the other tools were 0.63 and 0.54, respectively.Virus–host interactions/phage–host interactionsTo enhance the study of the mechanisms involved in Virus–Plasmid–Host interactions, it is essential to know the host of a particular virus, phage or plasmid. CRISPRloci therefore provides information for such interactions by detecting all types of matches between a given complete or partial phage genome, and for instance the database of CRISPR spacers from archaeal and bacterial genomes, based on CRISPRidentify.Processing and implementation CRISPRloci was implemented with the Freiburg RNA server (13) framework, which is based on Java Server Pages (JSP) processed by an Apache Tomcat server. The jobs of the four different webserver modes are executed within bioconda (25) environments, using pinned tool versions to ensure reproduciblity. The processing time (minutes) for the example datasets provided with the webserver is as follows: 33 (mode 1), 15 (mode 2), 1 (mode 3), 1 (mode 4). For each user submission, a unique link is generated which tracks the progress and retrieves the results upon completion.

Article TitleCRISPRloci:comprehensive and accurate annotation of CRISPR–Cas systems


CRISPRlocipipeline is implemented in Python, Perl and Java and freely available as both the webserver and standalone versions. The webserver can be accessed via the following link: The standalone version can be downloaded from the following GitHub repository:

Login or Signup to leave a comment
Find your community. Ask questions. Science is better when we troubleshoot together.
Find your community. Ask questions. Science is better when we troubleshoot together.

Have a question?

Contact or check out our support page.