Materials and Methods

PaCRISPR: a server for predicting and visualizing anti-CRISPR proteins

Here we describe the overall workflow of PaCRISPR in terms of data collection and curation, feature encoding, model training and integration, model performance evaluation, and toolkit development and usage (Figure ​(Figure11).

Figure 1.

The methodology of the PaCRISPR server. (A) Ensemble model construction. (B) Multiple-time undersampling to solve the data imbalance problem. (C) The architecture of the PaCRISPR web server.

Data collection and curation

To train and test the proposed method, we extracted 488 experimentally validated anti-CRISPRs from the Anti-CRISPRdb (14) and from the literature (17,22). After removing redundant sequences with more than 70% sequence identity, we obtained 98 sequences as positive samples in the training dataset (Supplementary Table S1). Considering that anti-CRISPRs are small proteins and found from a limited set of phages, as well as from a limited set of mobile genetic elements (MGEs), we constructed negative samples in the training dataset with four strict criteria. Negative sample proteins: (i) must not be known or putative anti-CRISPRs themselves; (ii) must be isolated from phage or from bacterial MGEs (which may be known or putative MGEs), where the given bacterial genera are known to harbour anti-CRISPRs; (iii) must have <40% sequence similarity to each other and the 98 positive samples; (iv) must have lengths that fall in the range between 50 and 350 residues, which is similar in length to the 98 positive samples. In this way, we obtained a training dataset with 98 positive and 902 negative samples, and they have similar distributions (Supplementary Figure S5).

To further test the proposed method, 26 newly discovered, highly distinct anti-CRISPRs were subsequently collected from emerging papers that were recorded in the unified online anti-CRISPR resource (13). These 26 positive samples comprised the independent dataset, where they possess less than 10% similarity against the 98 anti-CRISPRs in the training dataset, except for two that have similarities of 21.38% and 56.12% (Supplementary Table S2). We then collected 260 non-anti-CRISPRs using similar criteria to the selection of negative samples in the training dataset, which have <40% sequence similarity against the training dataset and the positive samples in the independent dataset. In total, the independent dataset has 26 positive and 260 negative samples (Supplementary Figure S6).

As the predictor was trained with small proteins, it is necessary to test its predictive power when identifying long non-anti-CRISPRs. We constructed two pure negative datasets through retrieving 266 non-anti-CRISPRs from phages, and 597 non-anti-CRISPRs from bacterial known and putative MGEs. Both datasets have less than 40% sequence similarity against each other and the above datasets, and contain only sequences with length >=350 residues.

We additionally used 5 very recently discovered anti-CRISPRs and a bacterial contig as case studies to validate the prediction capability of the proposed method in a more practical scenario (Supplementary Table S3).

Feature encoding

Novel anti-CRISPRs are especially difficult to identify given that they are significantly diverse, sharing no conserved sequence or structural motifs (3,4,24). The low sequence similarity therefore makes it particularly challenging to predict anti-CRISPRs from sequence-based features, which only mine characteristics from protein sequences. Instead, extracted from the Position-Specific Scoring Matrix (PSSM), evolutionary features to some extent track the evolutionary history of proteins and are proposed, therefore, to learn more informative patterns (25,26). Evolutionary features have been widely applied and demonstrated to have a significant contribution to protein attribute and function predictions, especially to identify those highly evolved proteins without observed signals (25–36).

To generate a PSSM, the PSI-BLAST program (37) (version blast-2.2.26 in this work) was used to iteratively (three iterations) search a given protein against a database (UniRef50) to detect its distantly related homologous proteins above a specified e-value score (0.001) (Figure ​(Figure1A).1A). Based on the multiple alignments of those homologous proteins, the generated PSSM combines their underlying conservation information and therefore could detect distant sequence similarities. For a protein with length of L, its PSSM is an L × 20 matrix (P = { }), where 20 represents the number of native amino acid types (Figure ​(Figure1A).1A). The element is a score that indicates the conservation degree of the _j-_th amino acid type at the _i_th position of the protein sequence. A high score denotes a highly conserved position, while a low score denotes a weakly conserved position (25,38).

Here, using the POSSUM toolkit (39), we extracted four evolutionary features through mining information from PSSM in different aspects, including PSSM-composition (33), DPC-PSSM (35), PSSM-AC (36) and RPSSM (40) (explained below). We also implemented two commonly-used sequence-based features as baseline features, including the AAC and dipeptide composition (DPC). AAC counts the frequencies of residues, while DPC counts the frequencies of dipeptides in a protein sequence.

PSSM-composition

As the rows of a PSSM depend on the length of its protein sequence, PSSM-composition removes this variability by transforming the variably-sized PSSM into a fixed-size matrix. By summing up and averaging all rows for each native amino acid type, PSSM-composition transforms the original PSSM into a 20 × 20 matrix:

subject to

where represents the _i_th row of the resultant matrix, denotes the _k_th row of the PSSM, denotes the _k_th residue in the original protein sequence, and denotes the _i_th native type of amino acids. Finally, PSSM-composition converts the 20 × 20 matrix line-by-line into a single 400-dimensional vector.

DPC-PSSM

DPC-PSSM transforms the columns of the PSSM to mine its local sequence-order effect, and generates a 400-dimentional vector as follows:

subject to

where represents the element at _k_th row and _i_th column of the PSSM.

PSSM-AC

PSSM-AC calculates the correlation between two elements within the PSSM using the following formulas:

subject to

where lg ranges from 1 to LG, and represents the element at i_th row and _j_th column of the PSSM. As a result, the number of elements in the PSSM-AC vector amounts to 20 × _LG, with LG<L. In this work, we used the default value 10 of the LG, and finally generated a 200-dimensional vector.

RPSSM

RPSSM explores the local sequence order effect but based on a reduced PSSM. It first generates an L × 10 reduced PSSM by merging some columns of the original PSSM, which could be represented as follows:

subject to

where , …, denote the 20 columns in the original PSSM corresponding to the 20 native types of amino acids. The reduced PSSM is further transformed into a 10-element vector:

subject to

where represents the element at _i_th row and _s_th column of the reduced PSSM. Also, the reduced PSSM could be further transformed into a 10 × 10 matrix to explore its local sequence order effect:

subject to

Finally, we obtained the RPSSM feature in 110 dimensions by combining and .

Model construction

To deal with the imbalanced classification problem, for each of the features, we constructed 10 subsets by combining the positive samples and the same numbers of randomly selected negative samples from the training datasets (Figure ​(Figure1B).1B). We accordingly trained 10 classifiers using the support vector machine (SVM) and integrated them by averaging their prediction outputs. SVM is widely used to solve binary classification problems in the field of computational biology (41). Particularly SVM with a radial basis function kernel (RBF) has been successfully used for nonlinear biological sequence classification (29,30). Two parameters affect the performance of the RBF kernel based SVM. Among them, Cost controls the cost of misclassification of data training, and Gamma is a specific parameter of the RBF kernel. In this study, for each SVM based classifier, the parameters Cost and Gamma were optimized using a grid search within the space {2−10,…,210}. In this way, we obtained an ensemble model as the baseline model for each feature (termed single feature-based model) (30,32). To make full use of different types of evolutionary features, we averaged the prediction scores of their single feature-based models to form the final ensemble model (Figure ​(Figure1A1A).

Performance evaluation

The proposed method was rigorously and extensively validated based on the 5-fold cross-validation test, an additional independent test, and prediction capability was further investigated using case studies. Performance measurements include Sensitivity (SN), Specificity (SP), Accuracy (ACC), F-value and Matthews correlation coefficient (MCC) (42), which are defined as follows:

where TP, TN, FP and FN denote the numbers of true positives, true negatives, false positives, and false negatives, respectively. For a predictor, SN and SP measure its power of identifying positive and negative samples, respectively. ACC, F-value, and MCC measure its comprehensive capability of identifying both positive and negative samples. Besides, the receiver operating characteristic (ROC) curve, with its AUC (area under the curve) value calculated, was used to visualize the prediction performance of a predictor.

Server construction

The architecture of the PaCRISPR server consists of two components: a client web interface and a server backend (Figure ​(Figure1C1C).

The client web interface is responsible for interacting with users through the input and output displays, and to process the service logic including the illegal character detection, sequence validation and format. The former was implemented by JSP, CSS, jQuery (https://jquery.com/), Bootstrap (https://bootstrapdocs.com/) and their extension packages. Specifically, the sequence similarity was visualized by BlasterJS (43), and the phylogenetic tree was presented using jsPhyloSVG (44).The latter was implemented by the JAVA (https://www.java.com/) server development suite, including Struts 2 (https://struts.apache.org/) and Hibernate (https://hibernate.org/).

The server backend is responsible for executing the whole prediction process, including encoding features, making predictions, and generating visualize-ready data. The prediction program was written in R language (https://www.r-project.org/) dependent on the e1071 package for SVM modelling (https://CRAN.R-project.org/package=e1071). The BLAST program (version 2.8.1+) (45) was used to search against the known anti-CRISPRs for each predicted anti-CRISPR, and to record regions of their similarities for sequence similarity visualization. The MAFFT toolkit (46) was used to generate multiple alignment results between each predicted anti-CRISPR and the known anti-CRISPRs for phylogenetic tree visualization. A Perl CGI (https://metacpan.org/pod/CGI) program was written to string together these steps within a single thread.

The client web interface interacts with the server backend through a fast and lightweight queueing system, implemented using the Gearman framework (http://gearman.org/). The client web interface simply puts the user's submissions (each of them as a job) into the queueing system, where the Perl idle threads, maintained in a daemon thread pool with customizable size, pull and execute the jobs. During the whole process, the MySQL database (https://www.mysql.com/) is used to store intermediate and final results, as well as synchronize messages between the client web interface and the server backend. In this way, the architecture brings better user experience by decoupling the client web interface that requires prompt response speed and the server backend that handles time-consuming jobs. This also makes the architecture amenable for expansions to add new computational facilities to meet the increasing demand in predicting ever accumulating genome-scale data.

Go to:

Article TitlePaCRISPR: a server for predicting and visualizing anti-CRISPR proteins

Abstract

Anti-CRISPRs are widespread amongst bacteriophage and promote bacteriophage infection by inactivating the bacterial host's CRISPR–Cas defence system. Identifying and characterizing anti-CRISPR proteins opens an avenue to explore and control CRISPR–Cas machineries for the development of new CRISPR–Cas based biotechnological and therapeutic tools. Past studies have identified anti-CRISPRs in several model phage genomes, but a challenge exists to comprehensively screen for anti-CRISPRs accurately and efficiently from genome and metagenome sequence data. Here, we have developed an ensemble learning based predictor, PaCRISPR, to accurately identify anti-CRISPRs from protein datasets derived from genome and metagenome sequencing projects. PaCRISPR employs different types of feature recognition united within an ensemble framework. Extensive cross-validation and independent tests show that PaCRISPR achieves a significantly more accurate performance compared with homology-based baseline predictors and an existing toolkit. The performance of PaCRISPR was further validated in discovering anti-CRISPRs that were not part of the training for PaCRISPR, but which were recently demonstrated to function as anti-CRISPRs for phage infections. Data visualization on anti-CRISPR relationships, highlighting sequence similarity and phylogenetic considerations, is part of the output from the PaCRISPR toolkit, which is freely available at http://pacrispr.erc.monash.edu/.


Login or Signup to leave a comment
Find your community. Ask questions. Science is better when we troubleshoot together.
Find your community. Ask questions. Science is better when we troubleshoot together.

Have a question?

Contact support@scifind.net or check out our support page.