Detection of CRISPR arrays
The complete genome collection of the PATRIC database (19) (a total of 110 334 genomes) was used in our analysis. CRISPR arrays were predicted for each genome using CRISPRDetect 2.2.1 (20) with a quality score cut-off of 3.
Detection of self-targeting spacers
All spacers were blasted (blastn-short option, DUST disabled, e-value cut-off of 1, gap open, and gap extend penalty of 10) against the source genome. The blastn results were filtered for a minimum identity higher than 90% with the target. Any hit on the genome was considered a self-target, except for those within all of the predicted CRISPR arrays, including arrays identified with a CRISPRDetect quality score <3. Hits closer than 500 bp from each end of the predicted arrays were also ignored to avoid considering spacers from the array that were possibly not identified by CRISPRDetect. Spacers with flanking repeats of identity score lower than 75% to each other were discarded as they may have been erroneously identified as spacers. Of these, only spacers smaller than 70 bp and a repeat size between 24 and 50 bp were retained in the dataset. Finally, STS from CRISPR arrays of two or fewer spacers were excluded, except when the associated repeat belonged to a known CRISPR repeat family, as identified by CRISPRDetect. Duplicates were removed by search of similar genomes, contigs and arrays.
Classification of CRISPR–Cas systems
The CRISPR–Cas systems of STS-containing genomes were classified using MacsyFinder (21) in combination with Prodigal (22), and the CRISPR-type definitions and Hidden Markov Models (HMM) profiles of CRISPRCasFinder (23). The classification of the repeat family of the CRISPR array was obtained using CRISPRDetect. Genomes carrying two or more CRISPR–Cas types were labeled as ‘mixed’, and those having CRISPR–Cas arrays but no cas genes were labeled as ‘no Cas’. Systems which could not be assigned a CRISPR sub-type and which were missing at least one cas gene (but contained no less than one cas gene) were classified as ‘incomplete’. The final classification of each genome can be found in Supplementary Table S1.
Analysis of the genomic target
The orientation of the arrays was determined by CRISPRDetect using the default parameters of CRISPRDirection. After this, the STS sequence was used for a gapless blastn at the target and to retrieve the PAM downstream or upstream of the STS based on the CRISPRDetect classification (see Supplementary Table S1). The targets were then analyzed for the correct PAM sequence by comparison with the expected PAM for the different CRISPR–Cas types as previously described (11,24,25). The consensus PAM sequences used in this analysis are shown in Supplementary Table S2. Genes of STS-containing genomes were predicted using Prodigal and annotated using Interproscan (26) and Pfam (27) domain prediction. Prophage regions in the genomes were detected using VirSorter (28), and used to identify STS targeting these regions. Transposons were also detected in the genomes using Interproscan (26) (Supplementary Table S3). Targets of the STS with e-value <10−5 were grouped by function to identify possibly enriched hits separately for prophage and endogenous regions. Only those hits associated with predicted correct PAMs were subjected to this analysis.
Distance between self-targeting spacer and prophages
Contigs predicted to contain prophages were extracted and used to create a hit density map based on STS distance to prophage(s).
Identification of anti-CRISPR proteins
The amino acid sequences of known Acrs (29) were used for similarity search in the STS-containing genomes using BLASTp with an e-value limit of 10−5.
A binomial test was performed on CRISPR arrays of different sizes to test the hypothesis that STS at the leader side of the CRISPR array are more common. Only STS from CRISPR arrays <50 spacers were considered because larger arrays are too scarce to result in a reliable statistical analysis. A chi-squared test was used to determine statistical significance between percentages of populations. Statistical significance was considered for P < 0.05.
GNU parallel was used to parallelize tool runs and for parsing of output files (30). Biopython package (31) functions were used for specific analysis, such as GFF parser for prodigal files, pairwise2 for removing false positives based on repeat identity, and nt_search for matching of the PAM. All data collected was managed using Python package Pandas (32). Python packages SciPy (33), Matplotlib (34) and Seaborn (35) were used for statistical analysis and visualization.
CRISPR–Cas systems require discriminating self from non-self DNA during adaptation and interference. Yet, multiple cases have been reported of bacteria containing self-targeting spacers (STS), i.e. CRISPR spacers targeting protospacers on the same genome. STS has been suggested to reflect potential auto-immunity as an unwanted side effect of CRISPR–Cas defense, or a regulatory mechanism for gene expression. Here we investigated the incidence, distribution, and evasion of STS in over 100 000 bacterial genomes. We found STS in all CRISPR–Cas types and in one fifth of all CRISPR-carrying bacteria. Notably, up to 40% of I-B and I-F CRISPR–Cas systems contained STS. We observed that STS-containing genomes almost always carry a prophage and that STS map to prophage regions in more than half of the cases. Despite carrying STS, genetic deterioration of CRISPR–Cas systems appears to be rare, suggesting a level of escape from the potentially deleterious effects of STS by other mechanisms such as anti-CRISPR proteins and CRISPR target mutations. We propose a scenario where it is common to acquire an STS against a prophage, and this may trigger more extensive STS buildup by primed spacer acquisition in type I systems, without detrimental autoimmunity effects as mechanisms of auto-immunity evasion create tolerance to STS-targeted prophages.