Software and code availability
Scripts for downloading data and reproducing all analyses are available at https://github.com/Russel88/CRISPRCas_on_Plasmids. Analyses were made with a combination of shell, python 3, and R 3.6.3 scripting. Plots were made with ggplot2, heatmaps with pheatmap, phylogenetic trees with iTOL (Letunic and Bork, 2021), and networks with gephi (Bastian, Heymann and Jacomy, 2009).
A total of 27,939 complete bacterial plasmid sequences were downloaded from PLSDB 202011_19 (https://ccb-microbe.cs.uni-saarland.de/plsdb) (Galata **_et al., 2019), together with their associated metadata (Galata** et al., 2019). A total of 253 manually curated archaeal plasmids were downloaded from NCBI RefSeq on January 6th 2020. Plasmid-host chromosome associations were determined through the NCBI assembly information, for which only sequences annotated as “chromosome” were included as host sequences. Using this approach, we were able to assign a host for 21,974 of the plasmids. The number of archaeal plasmids selected is relatively low because few archaeal plasmids have been characterised and sequenced (Lipps, 2007). We used GTDBtk v1.4.1 (Chaumeil et al., 2019) to re-annotate the taxonomy of the host of each plasmid in a common phylogenomic framework. To filter out redundant plasmids, they were de-replicated using dRep version 3.1.0 (Olm et al., 2017) with the following parameters: 90% ANI cut-off for primary clustering, 95% ANI cut-off for secondary clustering and a total coverage of 90%, with fastANI (Marçais et al., 2018) as secondary clustering algorithm. Size was the only criterion used to choose the plasmid to include in each cluster, such that the largest plasmid (or random among these given ties) was picked among the clustered plasmids. Dereplication resulted in a total of 17,828 plasmids, out of which 13,265 could be associated with known prokaryotic hosts.
Identification of CRISPR loci
Detection of CRISPR arrays was carried out by using CRISPRCasFinder 4.2.17 (Couvin et al., 2018), coupled to an optimized algorithm for false-positive array removal (Supplementary Figure S12) and an additional analysis for finding CRISPR loci that are commonly missed by this algorithm. Briefly, high confidence arrays predicted by CRISPRCasFinder (evidence level 4) were automatically kept. The remaining arrays were binned into a “quarantine list” if they were found to clear a series of conservative manually-curated parameter cutoffs: 1) calculated average CRISPR repeat conservation across the array > 70%, 2) spacer conservation < 50%, 3) standard error of the mean of the array’s spacer lengths < 3, and 4) array does not overlap with an open reading frame (ORF) with a prediction confidence of at least 90% (Hyatt et al., 2010). Putative arrays from the quarantined list were rescued for further analysis if they were found within 1 Kb to a predicted cas gene or matched (95% coverage and 95% identity) with any previously defined high confidence CRISPR repeat: CRISPRCasFinder evidence level 4 or archived in CRISPRCasdb (Pourcel et al., 2020). This upgrade reduced the rate of detection of false positive CRISPRs, most of which constitute short repetitive genomic regions that are erroneously selected by CRISPRCasFinder (Zhang and Ye, 2017), and which are more common on plasmids (e.g. iterons and tandem transposon-associated repeats) (Chattoraj, 2000; Giraldo and Fernández-Tresguerres, 2004; Oliveira et al., 2010). High confidence CRISPR repeats (see above) were then BLASTed (task: blastn-short, 95% coverage and 95% identity) to a database in which the CRISPR loci that were already detected were masked and any matches within 100 bp were clustered into arrays. Arrays with less than 3 repeats were excluded from all analyses.
Identification and typing of cas loci
The prediction and classification (at the subtype or variant level) of cas operons was carried out by CRISPRCasTyper 1.2.4 (https://github.com/Russel88/CRISPRCasTyper) (Russel et al., 2020). CRISPR arrays closer than 10 Kbp to the nearest cas operon were considered to be linked; the 10 Kbp cutoff was based on an analysis of the distribution of distances of CRISPR arrays to the closest cas operon (Supplementary Figure S13). Furthermore, we used CRISPR-repeat similarity information to type arrays that were not found linked to cas operons. These distant arrays (>10 Kbp from the nearest cas operon) were considered associated with a cas operon if the direct repeat sequence was at least 85% identical to the direct repeat sequence of an array adjacent to that cas operon (Supplementary Figure S14). When possible, CRISPR-Cas systems annotated as “Ambiguous” were manually subtyped.
Enrichment of certain CRISPR-Cas subtypes on either plasmids or host chromosomes was investigated with an indicator species analysis, using the indicspecies R package. For the comparison between all plasmids and chromosomes the IndVal.g statistic was used, which controls for difference in group sizes. For the direct comparison between plasmids and hosts chromosomes, where both carry CRISPR-Cas, the IndVal statistic was used. Statistical significance was determined by permutation (n=9999) and a Bonferroni adjusted p-value threshold of 0.05 was used.
Plasmid conjugative transfer and incompatibility group prediction
The conjugative transfer functions and incompatibility (Inc) typing of all plasmids in PLSDB was predicted with MOB-suite v3.0.1 using mob_typer function (Robertson and Nash, 2018) using default parameters.
Spacer-protospacer match analysis
The genomic regions where CRISPR arrays were identified on plasmids (including CRISPR arrays with 2 repeats, which were otherwise excluded from the analyses) were masked in order to avoid false positive matches to spacers in arrays. Furthermore, for matches to plasmids only matches to high confidence ORFs were included, also to rule out any matches to possibly undetected CRISPR arrays. Spacers from orphan arrays whose consensus repeat could not be typed by repeatTyper from CRISPRCasTyper (https://typer.crispr.dk, model version 202103 (Russel **_et al., 2020**)) were excluded from the spacer analysis to avoid any bias stemming from possible false positive arrays in this group.
Viral genomes were obtained from the IMG/VR v3 (2020-10-125.1, (Roux **_et al., 2021)) only including those annotated as “Reference”, which includes 39,296 viral genomes. Spacer sequences from plasmids and plasmid-associated host chromosomes were aligned against the masked dereplicated plasmid database and the virus database using FASTA 36.3.8e (Pearson, 2014**). Alignments were filtered using an e-value cutoff of 0.05. To reduce redundancy bias, spacers were only counted once, no matter the absolute number of matches.
Networks were visualized in gephi with layout generated by a combination of OpenOrd and Noverlap algorithms. For calculating taxonomic confinement of spacer-protospacer matches between plasmids, each pair of plasmids connected by at least one spacer-protospacer match was counted as one matching pair. Cross-targeting plasmids were included as two separate plasmid pairs. Confinement was calculated as the number of matches found exclusively within a specific taxonomic rank, such that each plasmid-plasmid pair was only counted once. For estimating confinement of random spacer-protospacer matching, the taxonomic annotations were permuted among the plasmid-plasmid pairs with observed spacer-protospacer matches. This was repeated 100 times and the median number of matches was used as an estimate of confinement for hypothetically random matches. For estimating targeting bias towards conjugative vs. non-conjugative plasmids each unique spacer was counted with a weight of 1 with the targeting bias proportional to the number of matches to conjugative and non-conjugative plasmids, respectively. For example, a spacer matching 4 conjugative plasmids and 1 non-conjugative plasmids is counted as 0.8 for conjugative matches and 0.2 for non-conjugative matches.
Article TitleCRISPR-Cas systems are widespread accessory elements across bacterial and archaeal plasmids
Many prokaryotes encode CRISPR-Cas systems as immune protection against mobile genetic elements (MGEs), yet, a number of MGEs also harbor CRISPR-Cas components. With a few exceptions, CRISPR-Cas loci encoded on MGEs are uncharted and a comprehensive analysis of their distribution, prevalence, diversity, and function is lacking. Here, we systematically investigated CRISPR-Cas loci across the largest curated collection of natural bacterial and archaeal plasmids. CRISPR-Cas loci are widely but heterogeneously distributed across plasmids and, in comparison to host chromosomes, their mean prevalence per Mbp is higher and their distribution is markedly distinct. Furthermore, the spacer content of plasmid CRISPRs exhibits a strong targeting bias towards other plasmids, while chromosomal arrays are enriched with virus-targeting spacers. These contrasting targeting preferences dominate across the diversity of CRISPR-Cas subtypes and host taxa, highlighting the genetic independence of plasmids and suggesting a major role of CRISPR-Cas for mediating plasmid-plasmid conflicts. Altogether, CRISPR-Cas are frequent accessory components of many plasmids, which is an overlooked phenomenon that possibly facilitates their dissemination across microbiomes.