MATERIALS AND METHODSDetection, clustering and classification of type IV modulesBacterial and archaeal complete and draft genomes were obtained from genebank and scanned with the TIGR03115 Csf2 model (11) using HMMER3 (12). Protein sequences from two genes upstream and downstream of the detected csf2 gene along with the Csf2 sequence itself were pooled and subjected to an all-against-all sequence comparison using FASTA (13). A neighbour-joining tree was constructed using distances derived from the aggregate similarities between each module pair using a previously described method (14). The tree was used to pick diverse representative type IV systems, which were then annotated manually using PSI-BLAST (15) searches. Hits on type III CRISPR–Cas systems were purged from the Csf2 tree. Following manual annotation, the protein sequences from the refined representative modules were pooled for another all-against-all sequence comparison. Protein sequences were clustered using the method previously described (16) and another aggregate module similarity tree was built for the refined representative type IV module set. The tree was overlaid with gene maps of the type IV modules marked with the obtained protein clustering information (Supplementary Figure S2). This was used for devising the subtypes/variants (Figure (Figure1),1), which were then searched for in all downloaded genomes using the HMMs corresponding to all defined protein clusters. Subsequently, all resulting final type IV and non-type IV system proteins were subjected to another all-against-all sequence comparison using FASTA in order to build the final aggregate similarity subtype tree (Supplementary Figure S3) in addition to building the final Csf2 maximum-likelihood tree (Supplementary Figure S1).Open in a separate windowFigure 1.A proposed classification of type IV CRISPR–Cas systems based on their genome loci architectures and evolutionary relationships. Phylogenetic tree depicting the typical operon organization of the identified subtype IV loci. A selected representative locus is shown for each clade wherein genes are colour-coded and labelled according to the protein families they encode, using both the cas (upper) and csf (lower) nomenclatures. Genes or CRISPR arrays that are not invariably present are represented with dashed lines on the gene maps. The number of loci identified for each clade is given on the right. Hypothesized gene gain/loss events over the course of evolution are shown on the left.Spacer-protospacer match analysisCRISPR arrays were detected with CRISPRCasFinder (4.2.17, (17)) and matched to a Type IV module if any predicted operon was within a 10 kb radius (distance to first gene in the operon, this cut-off was based on Supplementary Figure S13). Non-type IV CRISPR–Cas systems were also detected with CRISPRCasFinder in the same genome assemblies where type IV systems were found, and typing was manually corrected when necessary. Arrays were matched to an operon if it was within 10 kb (distance to first gene in the operon, see Supplementary Figure S13). Phage genomes were obtained from the April 2019 version of the millardlab.org phage database (http://millardlab.org/bioinformatics/bacteriophage-genomes/), and plasmid sequences were obtained from the PLSDB database (2019_03_05, (18)). In order to rule out false positive matches to conserved spacers within undetected arrays on plasmids, the putative arrays in the plasmid database were masked when detected via CRISPRCasFinder (17,19), CRISPRdetect (2.2) (17,19) and CRT (1.2) (20). Furthermore, all unique repeats pertaining to arrays from the initial CRISPRCasFinder search were aligned (blastn -task blastn-short, (21)) against the masked PLSDB database, and putative arrays were defined if two or more matches (E-value < 0.1) were found within 100 bp, and these regions were masked as well. Spacers from the initial CRISPRCasFinder search yielded a total of 6021 type IV and 11 230 non-type IV spacers. These were collapsed into a unique spacer set using cd-hit-est (22) in order to avoid overrepresented (redundant) spacers from sequencing biases. The unique spacer set, consisting of 1051 type IV and 6778 non-type IV spacers (Supplementary Table S4), was aligned against the masked plasmid and phage databases using FASTA (13); a spacer match was considered significant when the E-value was <0.05.Targeted gene enrichment analysisEnrichment in spacer targeting of certain functions was done by first predicting ORFs in all plasmid and phage genomes using Prodigal (23), and then clustering genes using the protein clustering algorithm previously described (16). The observed number of matches to each gene cluster was compared to 105 simulations of random draws from a binomial distribution with size n equal to the number of genes in the gene cluster and the probability, where the relative gene length was defined by dividing the length of each gene by the median gene length, and then finding the average length for each gene cluster. The above simulation only counted each spacer once, however, spacers usually match multiple genes in the same gene cluster. Therefore, each simulated match was multiplied by a random draw of the observed number of genes matched by a spacer matching that gene cluster.PAM identificationFrom the 1016 CRISPR arrays (481 type IV and 535 non-type IV) detected in type IV containing complete and draft genomes, the consensus repeat for each array was aligned against corresponding consensus repeats for all other arrays using needleall (24). Consensus repeats that differed from each other by more than two mismatches were assigned to separate repeat clusters, resulting in 171 repeat clusters in total. The previous unique spacer matching output from FASTA was surveyed for protospacers pertaining to each of the 171 repeat clusters separately. Spacers matching several phages and plasmids in the database were only counted once to circumvent sequencing bias in the database. Logo plots were drawn from the ten nucleotides immediately flanking each side of each unique protospacer. Protospacers with alignment lengths smaller than the total spacer length had their coordinates adjusted so all flanks within a repeat cluster were properly aligned.CRISPR–Cas subtype co-occurrence analysisCo-occurrence between type IV and non-type IV subtypes was analysed with phylogenetic logistic regression (phyloglm, maximum penalized likelihood estimation (13,25), with the non-type IV occurrence as the response and the type IV occurrence as the predictor. Besides the genomes with type IV systems, we supplemented the analysis with all complete genomes with at least one non-type IV operon (as defined by CRISPRCasFinder). The phylogenetic tree was based on 16S rRNA gene sequences detected by Barrnap (26, 27), aligned with mafft 7.307 (28), and tree made with FastTree2 (26), and was rooted by the archaeal clade. Edges of length zero were rescaled to the shortest non-zero branch length. Furthermore, outlier branches were pruned by removing tips for which the maximum phylogenetic variance-covariance was >2. Only non-type IV subtypes found in at least 100 genomes were included, and only the four most prevalent type IV subtypes were included. P-values were fdr-adjusted with the Benjamini–Hochberg method (29).CRISPR repeat heatmapAll unique CRISPR consensus repeats were aligned with the pairwise2 module from Biopython 1.73 (30). Repeats were globally aligned with globalxs with both open gap and extend gap penalties of 3, and no end gap penalties. Alignments were done on both strands, and the highest identity was used.Leader sequence analysisMultiple sequence alignments were performed with the upstream regions of a series of representatives of co-occurring IV-A3 and I-E CRISPR arrays using MUSCLE (31). Alignments were analysed and visually displayed using Jalview (30,32). The corresponding leader sequence conservation profiles were generated using WebLogo 3 (33).Plasmid mobility predictionThe mobility of all plasmids (conjugative, mobilizable or non-mobilizable) in PLSDB was predicted with mobtyper (34) with an E-value cut-off of 1e−10. For calculating whether certain mobility types were enriched in targeted plasmids, the number of matches were scaled such that the sum for each spacer was 1, which ensured that each spacer only counted once, no matter how many matches it had.Plasmid/prophage predictionTo predict whether the CRISPR–Cas operons were located on chromosomes or MGEs, we used an iterative heuristic; first, contigs from complete genomes were annotated as plasmids or chromosomes as described in the NCBI name. Second, for draft genomes PlasFlow (35) was used to detect contigs that were putatively part of plasmids. Third, VirSorter (35, 36) was used to predict the presence of prophages, and all operons within a category 1, 2, 4 or 5 region were classified as putative prophages.Protein structure predictionProtein homology models were generated with the Phyre2 protein structure prediction server (intensive mode) (37). Superimposition of protein structures were generated by the PyMol molecular visualization software (PyMOL Molecular Graphics System (C) Schrödinger, LLC).
Article TitleType IV CRISPR–Cas systems are highly diverse and involved in competition between plasmids
CRISPR–Cas systems provide prokaryotes with adaptive immune functions against viruses and other genetic parasites. In contrast to all other types of CRISPR–Cas systems, type IV has remained largely overlooked. Here, we describe a previously uncharted diversity of type IV gene cassettes, primarily encoded by plasmid-like elements from diverse prokaryotic taxa. Remarkably, via a comprehensive analysis of their CRISPR spacer content, these systems were found to exhibit a strong bias towards the targeting of other plasmids. Our data indicate that the functions of type IV systems have diverged from those of other host-related CRISPR–Cas immune systems to adopt a role in mediating conflicts between plasmids. Furthermore, we find evidence for cross-talk between certain type IV and type I CRISPR–Cas systems that co-exist intracellularly, thus providing a simple answer to the enigmatic absence of type IV adaptation modules. Collectively, our results lead to the expansion and reclassification of type IV systems and provide novel insights into the biological function and evolution of these elusive systems.