Materials and Methods

Machine learning predicts new anti-CRISPR proteins

MATERIALS AND METHODSData collection and preprocessingTo model the task of anti-CRISPR protein identification as a machine learning problem, a dataset consisting of examples from both positive (anti-CRISPR) and negative (non-anti-CRISPR) classes was needed. We collected anti-CRISPR information for proteins from the Anti-CRISPRdb (37). At the time the work was initiated, the database contained information for 432 anti-CRISPR proteins. In order to ensure that the machine learning model generalizes well to protein sequences that do not share high sequence similarity to known anti-CRISPR proteins, a 40% sequence identity threshold is used (38). The use of a 40% identity threshold represents a boundary where proteins above this threshold are likely to share the same structure and possibly function (39), thus providing a compromise between ensuring non-redundancy of the train and test datasets while retaining enough training examples for cross-validation. We used CD-HIT (40) to identify a non-redundant set (at the 40% sequence similarity threshold) of 20 experimentally verified Acrs (Supplementary Table S1). These proteins belong to different Acr classes: 12 of the proteins are active against subtype I-F CRISPR Cas systems, four against I-E, and four against II-A (10,13,17,20,22). This set constitutes the positive class of our dataset. We downloaded the complete proteomes of source species to which each of these proteins belong. Within these proteomes, any protein with 40% or higher sequence similarity with any protein in the set of known anti-CRISPR proteins was removed, and the remaining proteins were used to construct the negative dataset. For independent testing of the method, a dataset comprising 20 known Acrs separate from the training set (11–13,21,24,26,28,41) was used (Supplementary Table S2). The Acrs belonging to the test set were chosen to cover the wide variety of known Acr mechanisms and sequences (42), while mainly consisting of the three subtypes the model was trained on. Source proteomes for all these proteins were downloaded, based on open reading frame predictions on the NCBI database.Feature extractionIn line with existing machine learning based protein function prediction techniques, we used sequence features (43) based on amino acid composition and grouped dimer and trimer frequency counts (44). For this purpose, amino acids are first grouped into seven classes based on their physicochemical properties (44) (Supplementary Table S3) and the frequency counts of all possible groups labeled as dimers and trimers in a given protein sequence are used in conjunction with amino acid composition. All three types of features (amino acid composition, di- and tri-meric frequency counts) are normalized to unit norm resulting in a -dimensional feature vector representation for a given protein sequence (45,46).Machine learning modelThe underlying machine learning model for AcRanker has been built using EXtreme Gradient Boosting (XGBoost) (47). In machine learning, boosting is a technique in which multiple weak classifiers are combined to produce a strong classifier. XGBoost is a tree-based method (47) that uses boosting in an end-to-end fashion, i.e., every next tree tries to minimize the error produced by its predecessor. XGBoost has been shown to be a fast and scalable learning algorithm and has been widely used in many machine learning applications (47).In this work, we have used XGBoost as a pairwise ranking model to rank constituent proteins in a given proteome in descending order of their expected Acr behavior. The XGBoost model is trained in a proteome-specific manner to produce higher scores for known anti-CRISPR proteins as compared to non-anti-CRISPR proteins in a given proteome. In comparison to conventional XGBoost classification, the pairwise ranking model performed better in terms of correctly identifying known anti-CRISPR proteins in test proteomes in cross-validation (Supplementary Table S4). Specifically, given a set of training proteomes each with one or more known anti-CRISPR proteins, our objective is to obtain an XGBoost predictor with learnable parameters that generates a prediction score for a given protein sequence represented in terms of its feature vector . In proteome-specific training, we require the model to learn optimal parameters such that the score for a positive example (known anti-CRISPR protein) should be higher than for all negative examples (non-Anti-CRISPR proteins) within the same proteome. The hyperparameters of the learning model are selected through cross-validation and optimal results are obtained with the number of estimators set at 120, a learning rate of 0.1, a subsampling of 0.6 and a maximum tree depth of 3.Performance evaluationTo evaluate the performance of the machine learning model, we have performed leave-one-out cross-validation as well as validation over an independent test set. In a single fold of leave-one-out cross-validation, we set aside the source proteome of a given anti-CRISPR protein for testing and train on all other proteomes. To ensure an unbiased evaluation, all sequences in the training set with a sequence identity of 40% or higher with any test protein or among themselves are removed from the training set. Furthermore, all proteins in the test set with >40% sequence identity with known anti-CRISPR proteins in the training set are also removed. This ensures that there is only one known anti-CRISPR protein in the test set in a single fold. The XGBoost ranking model is then trained and the prediction scores for all proteins in the test set are computed. Ideally, the known anti-CRISPR protein in the proteome should score the highest across all proteins in the given test proteome. This process is then repeated for all proteomes in our dataset. The rank of the known anti-CRISPR protein in its source proteome is used as a performance metric.In bacteria, Acrs are usually located within prophage regions (13,48). Based on this premise, in another experiment for model evaluation, we passed only the proteins found within prophage regions to the model. To identify the prophage regions for a given bacterial proteome we used PHASTER (PHAge Search Tool Enhanced Release) web server (49) which accepts a bacterial genome and annotates prophage regions in it. The decision scores are computed for all phage proteins identified by PHASTER in the test proteome.To help assess AcRanker's performance during leave-one-out cross-validation, BLAST (Basic Local Alignment Search Tool) (50) similarity was used to set a minimum performance expectation. For each protein in a given test proteome, we compute blastp scores (with default parameters) with the set of known Acrs (excluding the tested protein) and rank proteins in the increasing order of the respective e-values.For independent validation, the ranking based XGBoost model trained over sequence features for all 20 source proteomes (Supplementary Table S1) has been tested for recently discovered Acrs (Supplementary Table S2) that are not part of our training set. The rank of a known Acr in its corresponding proteome was computed. Here again, we evaluated the model for both the complete proteome of the organism and the respective MGE subset identified by PHASTER.AcRanker webserverA webserver implementation of AcRanker is publicly available at http://acranker.pythonanywhere.com/. The webserver accepts a proteome file in FASTA format and returns a ranked list of proteins. The Python code for the webserver implementation is available at the URL: https://github.com/amina01/AcRanker.Acr candidate selectionSelf-Targeting Spacer Searcher (STSS; https://github.com/kew222/Self-Targeting-Spacer-Searcher) (11) was run with default parameters using ‘Streptococcus’ as a search term for the NCBI genomes database, which returned a list of all self-targets found in those genomes. Whether known acr genes were present in each of the self-targeting genomes was checked using a simple blastp search using default parameters with the Acr proteins stored within STSS. Twenty self-targeting genomes that contained at least one self-target with a 3′-NRG PAM were chosen for further analysis with AcRanker. Prophage regions with each genome were predicted using PHASTER (49). Then proteins found across all of the prophage regions predicted in a given genome were ranked with AcRanker.To select individual gene candidates for synthesis and biochemical validation, the 10 highest ranked proteins from each genome were examined by visual inspection for a strong promoter, a strong ribosome binding site, and an intrinsic terminator. Promoters were searched for manually by looking for sequences closely matching the strong consensus promoter sequence TTGACA-17(±1)N-TATAAT upstream of the acr candidate gene, or any genes immediately preceding it. The presence of a strong ribosome binding site (resembling AGGAGG) near the start codon was similarly searched for and was required to be upstream of a gene candidate for selection. Last, given the nature of Acrs to be clustered together, genes neighboring the best candidates were also selected for further testing/validation and comprise part of the 10-member candidate test set.Protein expression and purificationEach of the Acr candidates (Supplementary Table S5) were cloned into a custom vector (pET-based expression vector) such that each protein was N-terminally tagged with a 10xHis sequence, superfolder GFP, and a tobacco etch virus (TEV) protease cleavage site, available on Addgene (#140995–141004). Each Cas effector (Supplementary Table S6): Acidaminococcus sp. Cas12a (AsCas12a), Streptococcus pyogenes Cas9 (SpyCas9), Staphylococcus aureus Cas9 (SauCas9) and Streptococcus iniae Cas9 (SinCas9, Addgene #141076), were expressed as N-terminal MBP fusions. Proteins were produced and purified as previously described (33). Briefly, Escherichia coli Rosetta2 (DE3) containing Acr or Cas9 expression plasmids were grown in Terrific Broth (100 μg/ml ampicillin) to an OD600 of 0.6–0.8, cooled on ice, induced with 0.5 mM isopropyl-b-d-thiogalactoside and incubated with shaking at 16°C for 16 h. Cells were harvested by centrifugation, resuspended in wash buffer (20 mM Tris–Cl (pH 7.5), 500 mM NaCl, 1 mM tris(2-carboxyethyl)phosphine (TCEP), 5% (v/v) glycerol) supplemented with 0.5 mM phenylmethanesulfonyl fluoride and cOmplete protease inhibitor (Roche), lysed by sonication, clarified by centrifugation and purified over Ni-NTA Superflow resin (Qiagen) in wash buffer supplemented with 10 mM (wash) or 300 mM imidazole (elution). Elution fractions were pooled and digested overnight with recombinantly expressed TEV protease while dialyzed against dialysis buffer (20 mM Tris–Cl (pH 7.5), 125 mM NaCl, 1 mM TCEP, 5% (v/v) glycerol) at 4°C. The cleaved proteins were loaded onto an MBP-Trap (GE Healthcare) upstream of a Heparin Hi-Trap (GE Healthcare) in the case of SpyCas9, SauCas9 and SinCas9. Depending on the pI, TEV digested Acrs were loaded onto a Q (ML1, ML2, ML3, ML6, ML8, and ML10), heparin (ML4 and ML5), or SP (ML7 and ML9) Hi-Trap column. Proteins were eluted over a salt gradient (20 mM Tris–Cl (pH 7.5), 1 mM TCEP, 5% (v/v) glycerol, 125 mM–1 M KCl). The eluted proteins were concentrated and loaded onto a Superdex S200 Increase 10/300 (GE Healthcare) for SpyCas9, SauCas9, SinCas9 or Superdex S75 Increase 10/300 (GE Healthcare) for all the Acr candidates and developed in gel filtration buffer (20 mM HEPES-K (pH 7.5), 200 mM KCl, 1 mM TCEP and 5% (v/v) glycerol). The absorbance at 280 nm was measured by Nanodrop and the concentration was determined using an extinction coefficient estimated based on the primary amino acid sequence of each protein. Purified proteins were concentrated to approximately 50 μM for Cas9 effectors and 100 μM for Acr candidates. Proteins were then snap-frozen in liquid nitrogen for storage at –80°C. Purity and integrity of proteins was assessed by 4–20% gradient SDS-PAGE (Coomassie blue staining, Supplementary Figure S2A) and LC–MS (Supplementary Figure S2B).RNA preparationAll RNAs (Supplementary Table S7) were transcribed in vitro using recombinant T7 RNA polymerase and purified by gel extraction as described previously (51). Briefly, 100 μg/ml T7 polymerase, 1 μg/ml pyrophosphatase (Roche), 800 units RNase inhibitor, 5 mM ATP, 5 mM CTP, 5 mM GTP, 5 mM UTP, 10 mM DTT, were incubated with DNA target in transcription buffer (30 mM Tris–Cl pH 8.1, 25 mM MgCl2, 0.01% Triton X-100, 2 mM spermidine) and incubated overnight at 37°C. The reaction was quenched by adding 5 units RNase-free DNase (Promega). Transcription reactions were purified by 12.5% (v/v) urea-denaturing PAGE (0.5× Tris–borate–EDTA (TBE)) and ethanol precipitation. In vitro cleavage assay In vitro cleavage assays were performed at 37°C in 1× cleavage buffer (20 mM Tris–HCl pH 7.5, 100 mM KCl, 5 mM MgCl2, 1 mM DTT and 5% glycerol (v/v)) targeting a PCR amplified fragment of double-stranded DNA (Supplementary Table S8). For all cleavage reactions, the sgRNA was first incubated at 95°C for 5 min and cooled down to room temperature. The Cas effectors (SpyCas9, SauCas9, AsCas12a at 100 nM and SinCas9 at 200 nM respectively) were incubated with each candidate Acr protein at 37°C for 10 min before the addition of sgRNA (SpyCas9, SauCas9, AsCas12a sgRNA at 160 nM and SinCas9 sgRNA at 320 nM respectively) to form the RNP at 37°C for 10 min. The DNA cleavage reaction was then initiated with the addition of DNA target and reactions incubated for 30 min at 37°C before quenching in 1× quench buffer (5% glycerol, 0.2% SDS, 50 mM EDTA). Samples were then directly loaded to a 1% (w/v) agarose gel stained with SYBRGold (ThermoFisher) and imaged with a BioRad ChemiDoc.Competition binding experimentThe reconstitution of the SinCas9–sgRNA–ML1 and SinCas9–sgRNA–AcrIIA2 complex was carried out as previously described (52). Briefly, purified SinCas9 and in vitro transcribed sgRNA were incubated in a 1:1.6 molar ratio at 37°C for 10 min to form the RNP. To form the inhibitor bound complexes, a 10-fold molar excess of AcrIIA20 (ML1) or AcrIIA2 were added and incubated with the RNP complex at 37°C for 10 min. For the competition binding experiment, a 10-fold molar excess of AcrIIA20 was first incubated with the RNP complex at 37°C before incubation with a 10-fold molar excess of AcrIIA2 at 37°C for 10 min. Each complex was then purified by analytical size-exclusion chromatography (Superdex S200 Increase 10/300 GL column, GE Healthcare) pre-equilibrated with the gel filtration buffer (20 mM HEPES-K (pH 7.5), 200 mM KCl, 1 mM TCEP and 5% (v/v) glycerol) containing 1 mM MgCl2. The peak fractions were concentrated by spin concentration (3-kDa cutoff, Merck Millipore), quenched in 1X SDS-Loading dye (2% w/v SDS, 0.1% w/v bromophenol blue and 10% v/v glycerol) and boiled down to 20 μl before loading onto a 4–20% gradient SDS-PAGE.Mass spectrometryProtein samples were analyzed using a Synapt mass spectrometer as described elsewhere (53).

Article TitleMachine learning predicts new anti-CRISPR proteins

Abstract

A webserver implementation of AcRanker is publicly available athttp://acranker.pythonanywhere.com/. The Python code for the webserver implementation is available in the GitHub repository (https://github.com/amina01/AcRanker).


Login or Signup to leave a comment
Find your community. Ask questions. Science is better when we troubleshoot together.
Find your community. Ask questions. Science is better when we troubleshoot together.

Have a question?

Contact support@scifind.net or check out our support page.