CRISPR spacers and sequence data
221 089 spacers along with information on cas gene presence, genome and repeat sequence were obtained from CRISPRCasDb (Pourcel et al., 2020) in February 2020 and the taxonomy of the genomes was obtained from NCBI Taxonomy database (Federhen, 2012). We created our own sequence database by combining all sequences from the NCBI nucleotide database (Benson et al., 2018; Pruitt et al., 2005), environmental nucleotide database (Sayers et al., 2009), PHASTER (Arndt et al., 2016), Mgnify (Mitchell et al., 2020), IMG/M (I. M. A. Chen et al., 2017), IMG/Vr (Paez-Espino et al., 2019), HuVirDb (Soto-Perez et al., 2019), HMP database (Peterson et al., 2009), and data from Pasolli et al., 2019. All databases were accessed in February 2020.
Subtypes were predicted based on the repeat sequences using the subtype predictions and method described by Bernheim et al., 2020, where the subtype of a spacer was inferred by the similarity of its repeat sequence to repeat sequences with known subtype (74% identity threshold to infer subtype).
Blast hits and filtering
Hits between spacers and sequences from the aforementioned databases were obtained using the command line blastn program (Altschul et al., 1990) version 2.10.0, which was run with parameters word_size 10, gapopen 10, penalty 1 and an e-value cutoff of 1, to find as many potential targets as possible. These blast hits were then filtered to remove hits of spacers inside CRISPR arrays and false positive hits found by chance. Hits inside CRISPR arrays were detected by aligning the repeat sequence of the spacer to the flanking regions of the spacer hit (23 nucleotides on both sides). This alignment was done using the globalxs function from the Biopython pairwise2 package (Cock et al., 2009) with −3 gap open and −3 gap extend parameters. If more than 13 nucleotides were identical in the alignment of at least one flank, the hit was suspected to fall inside a CRISPR array and was filtered out.
To minimize the number of hits found by chance, we filtered hits based on the fraction of spacer nucleotides that hit the target sequence, as this metric considers both the sequence identity and the coverage of the spacer by the blast hit. In a first step, only hits with this fraction higher than 90% were kept. To find targets for even more spacers while keeping the number of false positives low, we included a second step where hits with a fraction higher than 80% were kept if another spacer from the same genus hit the same contig or genome in the first step. This second step did not introduce hits on any new contigs or genomes and was based on the assumption that multiple spacers from the same genus hitting the same contig or genome is unlikely to be caused by chance. Finally, we removed spacers that were shorter than 27 nucleotides (54 spacers) and removed 7 spacers that were hitting aspecifically, such as inside ribosomal RNAs or tRNAs. This left 72,099 unique spacers with target hits for downstream analysis.
Protospacer flank alignment for orientation and PAM predictions
The PAM is known to occur on the 5’ end of the protospacer for Type I, Type IV and V CRISPR-Cas systems, and on the 3’ end for Type II systems (Collias & Beisel, 2021; Jackson et al., 2017). We used this property to predict the orientation of transcription of CRISPR arrays and sequence of crRNA. The PAM sides were compared to the nucleotide conservation in the flanking regions of the spacer hits and the spacer orientations were predicted such that the flank with the greater conservation matched the known PAM side.
To measure the nucleotide conservation in the flanking regions, data from multiple spacers was combined based on the subtype and repeat sequences of the spacers. Highly similar repeat sequences from the same subtype were clustered using CD-HIT (Fu et al., 2012) with a 90% identity threshold. We hypothesized that similar repeat sequences would be used in a similar orientation and would utilize the same PAM sequences, as coevolution of PAM, repeat and Cas1 and Cas2 sequences has been shown previously (Alkhnbashi et al., 2014; Lange et al., 2013). For each repeat cluster the flanking regions of the spacer hits were aligned. To equally weigh each spacer within the repeat cluster, irrespective of the number of blast hits, consensus flanks were obtained per spacer. These consensus flanks contained the most frequent nucleotide per position of the flanking regions. From the alignment of consensus flanks the nucleotide conservation, or information content, in each flank was calculated in bitscore (Schneider & Stephens, 1990) using the Sequence logo python package. We corrected for GC-content of the targeted sequences by calculating the expected occurrences of each nucleotide based on the GC-content of the spacer sequences. To minimize the number of orientation predictions based on little or noisy data, we only predicted the orientation for repeat clusters when the alignment of consensus flanks consisted of at least 10 unique protospacers. Furthermore, the information content of at least two positions was higher than 0.3 bitscore and higher than 5 times the median bitscore calculated from 23-nt flanks on both sides. These parameters were chosen as strictly as possible, while still yielding orientation predictions for the highest number of spacers.
Using the orientation predictions described above, we predicted the PAMs for each repeat cluster by checking which nucleotide positions were conserved. To minimize PAM predictions based on noise, we only predicted the PAM for repeat clusters where the alignment of consensus flanks consisted of at least 10 unique protospacers. A nucleotide position was predicted to be part of the PAM when higher than 0.5 bitscore and higher than 10 times the median bitscore. These parameters were chosen as strictly as possible, while maximizing the number of repeat clusters with PAM predictions and minimizing the number of unique PAMs predicted.
We subsequently categorized and counted multi-effector compatible spacers in the following ways. Firstly by an occurrence of multiple repeat clusters with different subtype classification that both contained the same PAM, for two DNA targeting clusters (category I) or a DNA and a RNA targeting cluster (category II). Secondly if multiple cas gene clusters from different subtypes were in the vicinity of a single repeat cluster and their genomes did not further contain other arrays linked to these cas gene clusters they were counted as a third category multi-effector compatible array.
Coding versus template strand targeting analysis
For each spacer target inside an open reading frame (ORF), we determined if the spacer targets the coding (DNA and RNA) or template strand (DNA-only) during transcription. The ORFs and its orientation were predicted using Prodigal (Hyatt et al., 2010) for one target sequence per spacer. The target sequence of each spacer was selected as the longest hit sequence in the NCBI nucleotide database, excluding ‘other sequences’, or, if no such sequence was hit, the longest hit sequence in metagenomics database. Using our spacer orientation predictions for Type I, II and IV spacers, and the orientation predictions from CRISPRCasDb for the other spacers, we checked if the spacer target (blast hit orientation) was on the coding or template strand of the predicted ORF. To test for significant bias towards either the temperate or the coding strand, a two-sided tailed binomial test was performed with an expected probability of 0.5.
Article TitleComprehensive PAM prediction for CRISPR-Cas systems reveals evidence for spacer sharing, preferred strand targeting and conserved links with CRISPR repeats
The adaptive CRISPR-Cas immune system stores sequences from past invaders as spacers in CRISPR arrays and thereby provides direct evidence that links invaders to hosts. Mapping CRISPR spacers has revealed many aspects of CRISPR biology, including target requirements such as the protospacer adjacent motif (PAM). However, studies have so far been limited by a low number of mapped spacers in the database. By using vast metagenomic sequence databases, we mapped one third (∼70,000) of more than 200,000 unique CRISPR spacers from a variety of microbes, and derived a catalog of more than one hundred unique PAM sequences associated with specific CRISPR subtypes. These PAMs were further used to correctly assign the orientation of CRISPR arrays, revealing conserved patterns between the last nucleotides of the CRISPR repeat and PAM. From the curated CRISPR arrays dataset we could also deduce CRISPR subtype specific preferences for targeting either template or coding strand of open reading frames. While some DNA-targeting systems (e.g. Type I-E and Type II systems) prefer the template strand and avoid mRNA, other DNA- and RNA-targeting systems (i.e. Type I-A, I-B and Type III systems) prefer the coding strand and mRNA. In addition, we found large scale evidence that both CRISPR adaptation machinery and CRISPR arrays are shared between different CRISPR-Cas systems. This could lead to simultaneous DNA- and RNA targeting of invaders, which may be effective at combating mobile genetic invaders.