Materials and Methods

CRISPR-Cas systems are widespread accessory elements across bacterial and archaeal plasmids

MATERIALS AND METHODSSoftware and code availabilityScripts for downloading data and reproducing all analyses are available at Analyses were made with a combination of shell, python 3, and R 3.6.3 scripting. Plots were made with ggplot2, heatmaps with pheatmap, phylogenetic trees with iTOL (38), and networks with gephi (39).Dataset constructionA total of 27 939 complete bacterial plasmid sequences were downloaded from PLSDB 2020_11_19 ( (40), together with their associated metadata (40). A total of 253 manually curated archaeal plasmids were downloaded from NCBI RefSeq on 6 January 2020. Plasmid-host chromosome associations were determined through the NCBI assembly information, for which only sequences annotated as ‘chromosome’ were included as host sequences. Using this approach, we were able to assign a host for 21 974 of the plasmids. The number of archaeal plasmids selected is relatively low because few archaeal plasmids have been characterised and sequenced. We used GTDBtk v1.4.1 (41) to re-annotate the taxonomy of the host of each plasmid in a common phylogenomic framework. To filter out redundant plasmids, they were de-replicated using dRep version 3.1.0 (42) with the following parameters: 90% ANI cut-off for primary clustering, 95% ANI cut-off for secondary clustering and a total coverage of 90%, with fastANI (43) as secondary clustering algorithm. Size was the only criterion used to choose the plasmid to include in each cluster, such that the largest plasmid (or random among these given ties) was picked among the clustered plasmids. Dereplication resulted in a total of 17 828 plasmids, out of which 13 265 could be associated with known prokaryotic hosts.Identification of CRISPR lociDetection of CRISPR arrays was carried out by using CRISPRCasFinder 4.2.17 (44), coupled to an optimized algorithm for false-positive array removal (Supplementary Figure S1) and an additional analysis for finding CRISPR loci that are commonly missed by this algorithm. Briefly, high confidence arrays predicted by CRISPRCasFinder (evidence level 4) were automatically kept. The remaining arrays were binned into a ‘quarantine list’ if they were found to clear a series of conservative manually-curated parameter cutoffs: (i) calculated average CRISPR repeat conservation across the array >70%, (ii) spacer conservation <50%, (iii) standard error of the mean of the array's spacer lengths <3 and (iv) array does not overlap with an open reading frame (ORF) with a prediction confidence of at least 90% (45). Putative arrays from the quarantined list were rescued for further analysis if they were found within 1 kb to a predicted cas gene or matched (95% coverage and 95% identity) with any previously defined high confidence CRISPR repeat: CRISPRCasFinder evidence level 4 or archived in CRISPRCasdb (46). This upgrade reduced the rate of detection of false positive CRISPRs, most of which constitute short repetitive genomic regions that are erroneously selected by CRISPRCasFinder (47), and which are more common on plasmids (e.g. iterons and tandem transposon-associated repeats) (48–50). High confidence CRISPR repeats (see above) were then BLASTed (task: blastn-short, 95% coverage and 95% identity) to a database in which the CRISPR loci that were already detected were masked and any matches within 100 bp were clustered into arrays. Arrays with less than three repeats were excluded from all analyses.Identification and typing of cas lociThe prediction and classification (at the subtype or variant level) of cas operons was carried out by CRISPRCasTyper 1.2.4 ( (51). CRISPR arrays closer than 10 kb to the nearest cas operon were considered to be linked; the 10 kb cutoff was based on an analysis of the distribution of distances of CRISPR arrays to the closest cas operon (Supplementary Figure S2). Furthermore, we used CRISPR-repeat similarity information to type arrays that were not found linked to cas operons. These distant arrays (>10 kb from the nearest cas operon) were considered associated with a cas operon if the direct repeat sequence was at least 85% identical to the direct repeat sequence of an array adjacent to that cas operon (Supplementary Figure S3). When possible, CRISPR-Cas systems annotated as ‘Ambiguous’ were manually subtyped. The identified CRISPR-Cas loci on plasmids, plasmid-associated host chromosomes and related information are found in Supplementary Datasheet S1.Indicator analysisEnrichment of certain CRISPR-Cas subtypes on either plasmids or host chromosomes was investigated with an indicator species analysis, using the indicspecies R package. For the comparison between all plasmids and chromosomes the IndVal.g statistic was used, which controls for difference in group sizes. For the direct comparison between plasmids and hosts chromosomes, where both carry CRISPR-Cas, the IndVal statistic was used. Statistical significance was determined by permutation (n = 9999) and a Bonferroni adjusted P-value threshold of 0.05 was used.Plasmid conjugative transfer and incompatibility group predictionThe conjugative transfer functions and incompatibility (Inc) typing of all plasmids in PLSDB was predicted with MOB-suite v3.0.1 using mob_typer function (52) using default parameters.Spacer-protospacer match analysisThe genomic regions where CRISPR arrays were identified on plasmids (including CRISPR arrays with two repeats, which were otherwise excluded from the analyses) were masked in order to avoid false positive matches to spacers in arrays. Furthermore, for matches to plasmids only matches to high confidence ORFs were included, also to rule out any matches to possibly undetected CRISPR arrays. Spacers from orphan arrays whose consensus repeat could not be typed by repeatTyper from CRISPRCasTyper (, model version 2021_03 (51)) were excluded from the spacer analysis to avoid any bias stemming from possible false positive arrays in this group.Viral genomes were obtained from the IMG/VR v3 (2020-10-12_5.1, (53)) only including those annotated as ‘Reference’, which includes 39 296 viral genomes. Spacer sequences from plasmids and plasmid-associated host chromosomes were aligned against the masked dereplicated plasmid database and the virus database using FASTA 36.3.8e (54). Alignments were filtered using an e-value cutoff of 0.05. To reduce redundancy bias, spacers were only counted once, no matter the absolute number of matches.Networks were visualized in gephi with layout generated by a combination of OpenOrd and Noverlap algorithms. For calculating taxonomic confinement of spacer-protospacer matches between plasmids, each pair of plasmids connected by at least one spacer-protospacer match was counted as one matching pair. Cross-targeting plasmids were included as two separate plasmid pairs. Confinement was calculated as the number of matches found exclusively within a specific taxonomic rank, such that each plasmid-plasmid pair was only counted once. For estimating confinement of random spacer-protospacer matching, the taxonomic annotations were permuted among the plasmid-plasmid pairs with observed spacer-protospacer matches. This was repeated 100 times and the median number of matches was used as an estimate of confinement for hypothetically random matches. For estimating targeting bias towards conjugative versus non-conjugative plasmids each unique spacer was counted with a weight of 1 with the targeting bias proportional to the number of matches to conjugative and non-conjugative plasmids, respectively. For example, a spacer matching four conjugative plasmids and one non-conjugative plasmids is counted as 0.8 for conjugative matches and 0.2 for non-conjugative matches. The spacer-protospacer matches identified for plasmid and associated host chromosome-derived CRISPR array contents are found in Supplementary Datasheet S2.

Article TitleCRISPR-Cas systems are widespread accessory elements across bacterial and archaeal plasmids


Scripts for downloading data and reproducing all analyses are available at

Login or Signup to leave a comment
Find your community. Ask questions. Science is better when we troubleshoot together.
Find your community. Ask questions. Science is better when we troubleshoot together.

Have a question?

Contact or check out our support page.