Database acquisition and contig assembly
NCBI genomes were downloaded using NCBI Genome Downloading Scripts (https://github.com/kblin/ncbi-genome-download) on May 5, 2021, with the commands:
ncbi-genome-download --formats fasta bacteria
ncbi-genome-download --formats fasta archaea
Raw FASTQ files were downloaded from the EMBL-EBI repository (17) of metagenomic sequencing at ftp://ftp.ebi.ac.uk/vol1/ between January and February 2020. For each sample, the quality of the raw data was assessed with FastQC (56) using the command:
Low quality reads were trimmed with Sickle (57) using the command:
We then used Megahit (58) to assemble the trimmed data with:
We developed a Python library, Opfi (short for Operon Finder), to search genomic or metagenomic sequence data for putative CRISPR transposons. This library consists of two modules, Gene Finder and Operon Analyzer. The Gene Finder module enables the user to use BLAST to identify genomic neighborhoods that contain specific sets of genes, such as Cas9 or TnsA. It can also identify CRISPR repeats. The Operon Analyzer module further filters the output from Gene Finder by imposing additional user-defined constraints on the initial hits. For example, Operon Analyzer can be used to find all genomic regions that contain a transposase and at least two Cas genes but no Cas3.
We used Gene Finder to locate genomic regions of interest using the following logic. First, we located all regions containing at least one transposase gene. Within those regions, we next searched for Cas genes located no more than 25 kilobase pairs away from a transposase. Transposase-containing regions without at least one nearby Cas gene were discarded from further analysis. Finally, the remaining regions were further annotated for Tn7 accessory genes (TnsC-TnsE and TniQ), and CRISPR arrays.
We further processed and categorized the Gene Finder hits using Operon Analyzer. To identify Tn7-like CRISPR-transposons, we required each putative operon to contain TnsA, TnsB, TnsC, and at least one Cas gene from Cas5-13; the distance between TnsA, TnsB, and TnsC needed to be less than 500 bp; the Cas proteins need to be downstream of TnsA/B/C and the distance between any Cas protein and TnsB needed to be less than 15 kbp. We classify this dataset into putative Class 1 systems and Class 2 systems based on their Cas signature proteins. Class 1 systems were manually reviewed to confirm the loss of adaptation (Cas1, Cas2) and interference (Cas3) proteins.
To identify non-Tn7 CRISPR-transposons, we required each putative operon to contain a CRISPR array, a transposase, and at least one Cas gene from Cas5-13. We also excluded systems containing Tn7 proteins, Cas1, or Cas2. We then partitioned this dataset into putative Class 1 systems (defined as loci with any three of Cas5/6/7/8) or Class 2 systems (Cas9, Cas12, or Cas13). For Class 1 systems, we further excluded those containing Cas3 or Cas10. To eliminate systems with fragments of effector proteins or poor matches to unrelated proteins, we required that Cas9 have a size of 2-6 kbp, Cas12 a size of 3-6 kbp, and Cas13 a size of 2.5-6 kbp. We eliminated Class 2 systems that were nucleolytically active (see below), and finally clustered all systems using mmseqs easy-cluster with a minimum sequence ID of 0.95 (59) to simplify manual curation.
BLAST database construction
To find as many systems as possible, we assembled separate databases for Cas proteins, Tn7-family proteins, and non-Tn7 transposases. We also developed databases for common Tn7 attachment sites following a separate effort (21).
We downloaded all available bacterial and archaeal transposase sequences from UniRef50, excluding partial sequences and sequences annotated with the word “zinc” (which tended to be false positives) as well as Tn7-related proteins. All transposases associated with transposons listed in the Transposon Registry (60) were downloaded from NCBI. Finally, 100 transposases associated with each of the major families of insertion sequences were downloaded from NCBI, again excluding partial sequences, and using the ‘relevance’ sort parameter.
Amino acid sequences for Cas1–Cas13 and Tn7 family proteins (TnsA–TnsE, TniQ) were downloaded from UniRef50 (https://www.uniprot.org/uniref/). Additional Cas12 and Cas13 sequences, representing recently identified variants (e.g. Cas12k), were downloaded from the NCBI protein database (https://www.ncbi.nlm.nih.gov/protein/) and from primary literature sources (61–63).
To eliminate redundant sequences, each database was clustered using CD-HIT (http://weizhongli-lab.org/cd-hit/ (64) with a 50% sequence identity threshold and 80% alignment overlap. The clustered datasets were converted to the BLAST database format using makeblastdb (version 2.6.0 of NCBI BLAST+) with the following arguments:
-in <sequence fasta file>
-title <database name>
-out <database name>
The full-length sequences of GuaC (PF00478), RsmJ (PF04378), YciA (PF03061) were downloaded from (http://pfam.xfam.org/). The attachment site SRP-RNA gene (ffs) (RF00169) was downloaded from RFAM (https://rfam.xfam.org/).
To assign putative Cas5-Cas8 proteins to specific Type I CRISPR-Cas subtypes, we manually collected Cas proteins and their assignments from reviews by Koonin and colleagues (62, 65, 66). All Cas protein sequences were converted into BLAST databases using makeblastdb (version 2.6.0) with default parameters.
De-duplication of putative operons
Approximately 57% of the metagenomic systems that passed our initial filter were nearly identical at the nucleotide sequence level. However, exact nucleotide comparisons were too slow to de-duplicate this large dataset. Instead, we considered two systems to be identical if they met the following properties: (1) they had the same protein-coding genes and CRISPR arrays in the same order; (2) the genes had the same relative distances to each other; and (3) the translated sequences of all proteins were identical. This de-duplication was applied to all systems before the downstream analysis.
Self-targeting spacer identification
Spacer sequences that were identified with PILER-CR were pairwise aligned with the contig sequence that contained them, using the Smith-Waterman local alignment function from the parasail library (67), with gap open and gap extension penalties of 8, and using the NUC44 substitution matrix. Spacers with at least 80% homology to a location in the contig were classified as self-targeting.
For Type V systems, we augmented the CRISPR array search with minCED 0.4.2 (68) after noticing transposons that were otherwise intact but seemingly lacked CRISPR arrays. The region between Cas12k and 200 bp after the end of the nearest CRISPR array was used to search for spacers (both atypical and canonical). Targets were searched for in the 500 bp region immediately downstream of the spacer search region, using the method described in the previous paragraph. For Type V systems with multiple Cas12k genes, each spacer region was aligned to each target region in order to discover systems where multiple transposons had inserted at the same attachment site.
Alignments of protein sequences were constructed with MAFFT, version v7.310 (69). Phylogenetic analysis was performed on the aligned sequences using the IQ-TREE, version 1.6.12 (70), with automatic model selection. Models used were as follows: Figure 3B: JTT+F+R3, Figure 4B cas6: PMB+G4, Figure 4B cas7: PMB+G4, Figure 4C: PMB+G4, Supplemental Figure 1 tnsB: LG+R5, Supplemental Figure 1 tnsC: LG+G4. Trees were visualized using the Figtree program version 1.4.4.
Classification of nuclease-dead systems
To identify catalytically inactive Class 2 nucleases, we aligned each nuclease to a reference protein with MAFFT (version v7.310, with the FFT-NS-2 strategy for Cas9 and Cas12, and FFT-NS-i for Cas13). Cas9 homologs were aligned to SpCas9 (UniProtKB Q99ZW2.1, residues D10 and H840), Cas12 homologs to AsCas12a (UniProtKB U2UMQ6, residues D908 and E993), and Cas13 homologs to LbuCas13a (UniProtKB C7NBY4.1, residues R472, H477, R1048, and H1053). Mutations of D/E to anything other than D/E, or H/R to anything other than H/R/K were considered nuclease dead. As a control, we employed this technique on all 279 Cas12k proteins from NCBI as well as LbCas12a and FnCas12a (two known nuclease-active Cas12a proteins) and found that we correctly categorized them all.
Identification of inverted repeats and target site duplications
To identify inverted repeats, we used Generic Repeat Finder (commit hash: 35b1c4d6b3f6182df02315b98851cd2a30bd6201) (71) with default parameters except as follows:
--min_space <operon length>
--max_space <buffered operon length>
where <operon length> is the length of the putative operon and <buffered operon length> is the length of the putative operon, plus up to 1000 bp to allow a 500 bp buffer on either side of the operon. This detected inverted repeats that were at least 15 bp long. In cases where one inverted repeat fell within the bounds of the putative operon, it was discarded.
CRISPR-associated transposons (CASTs) co-opt Cas genes for RNA-guided transposition. CASTs are exceedingly rare in genomic databases; recent surveys have reported Tn7-like transposons that co-opt Type I-F, I-B, and V-K CRISPR effectors. Here, we expand the diversity of reported CAST systems via a bioinformatic search of metagenomic databases. We discover new architectures for all known CASTs, including novel arrangements of the Cascade effectors, new self-targeting modalities, and minimal V-K systems. We also describe new families of CASTs that have co-opted the Type I-C and Type IV CRISPR-Cas systems. Our search for non-Tn7 CASTs identifies putative candidates that co-opt Cas12a for horizontal gene transfer. These new systems shed light on how CRISPR systems have co-evolved with transposases and expand the programmable gene editing toolkit.