MATERIALS AND METHODSProkaryotic genome database and open reading frame annotation. Archaeal and bacterial complete and draft genome sequences were downloaded from the NCBI FTP site (ftp://ftp.ncbi.nlm.nih.gov/genomes/all/) in March 2016. For incompletely annotated genomes (coding density of less than 0.6 coding DNA sequences CDS per kbp), the existing annotation was discarded and replaced with a Meta-GeneMark 1 (44) annotation with the standard model MetaGeneMark_v1.mod (heuristic model for genetic code 11 and GC 30). Altogether, the database includes 4,961 completely sequenced and assembled genomes and 43,599 partially sequenced genomes. Profiles for RT families (cd01651, pfam00078, and COG3344) that are included in the NCBI CDD database (45) were used as queries for a PSI-BLAST search (e-value = 1e−4) to identify RT homologs. The RT genes were used as a seed to identify defense islands as described previously (46). All ORFs within loci were annotated using RPS-BLAST searches with 30,953 profiles (COG, pfam, cd) from the NCBI CDD database and 217 custom Cas protein profiles (7). The CRISPR-Cas system (sub)type identification for all loci was performed using previously described procedures (7).Sequence clustering, alignment, and phylogenetic analyses. To construct a nonredundant, representative RT sequence set, sequences were clustered using the NCBI BLASTCLUST program (ftp://ftp.ncbi.nih.gov/blast/documents/blastclust.html) with a sequence identity threshold of 90% and length coverage threshold of 0.9. Short fragments or disrupted sequences were discarded. Multiple alignments of protein sequences were constructed using MUSCLE (47). Sites with gap character fraction values of >0.5 and homogeneity values of <0.1 were removed from the alignment. Phylogenetic analysis was performed using the FastTree program (48), with the WAG evolutionary model and the discrete gamma model with 20 rate categories. The same program was used to compute SH (Shimodaira-Hasegawa)-like node support values.High-throughput sequencing of CRISPR spacers. This method is a slight modification of a previously published protocol for CRISPR spacer sequencing (16); we have provided the protocol in its entirety for completeness, retaining relevant text from the original protocol. CRISPR spacers were amplified by PCR from 1 to 2 ng genomic DNA per μl PCR mix using primers anchored in the various CRISPR repeat sequences. The primers used for type III-D CRISPR arrays were as follows: SS-4F, CGACGCTCTTCCGATCTNNNNNCTTGCGGGGAATTGGTAGGG; SS-4R, ACTGACGCTAGTGCATCAAATTCCCCGCAAGGGGACGG; SS-5F, CGACGCTCTTCCGATCTNNNNNCCAATTCCCCGCAAGGGGAC; SS-5R, ACTGACGCTAGTGCATCATGCGGGGAATTGGTAGGGTC; SS-8F, CGACGCTCTTCCGATCTNNNNNCCAATTCCCCGTCAGGGGAC; and SS-8R, ACTGACGCTAGTGCATCAGACGGGGAATTGGTAGGGTT. The primers used for type III-B CRISPR arrays were as follows: SS-19F, CGACGCTCTTCCGATCTNNNNNTAACTTTCARAGAAGTYTAA; SS-19R, ACTGACGCTAGTGCATCATGAAAGTTAAACGTATGGAA; SS-20F, CGACGCTCTTCCGATCTNNNNNCGACTTTCAAAGAAGTCTCA; SS-20R, ACTGACGCTAGTGCATCATGAAAGTCGAACGTATGGCA; SS-51F, CGACGCTCTTCCGATCTNNNNNTTCTYTGAAAGTTAAACGTA; and SS-51R, ACTGACGCTAGTGCATCATTTAACTTTCARAGAAGTTT. F and R denote forward and reverse primers. Primers with the same numeric code were used together. Letters correspond to IUPAC nucleic acid notation. CRISPR repeat matching regions are underlined.Sequencing adaptors were then attached in a second round of PCR with 0.01 volumes of the previous reaction mixture as the template, using AF-SS-44:55 (CAAGCAGAAGACGGCATACGAGATNNNNNNNNGTGACTGGAGTTCAGACGTGTGCTCTTCCGATCACTGACGCTAGTGCATCA) and AF-KLA-67:74 (AATGATACGGCGACCACCGAGATCTACACNNNNNNNNACACTCTTTCCCTACACGACGCTCTTCCGATCT), where the (N)8 barcodes correspond to Illumina TruSeq HT indexes D701 to D712 (reverse complemented) and D501 to D508, respectively. Template-matching regions in primers are underlined. Phusion High-Fidelity PCR master mix with HF buffer (Fisher Scientific) was used for all reactions. Cycling conditions were as follows: 98°C for 1 min; 2 cycles of 98°C for 10 s (60°C for 20 s for primer pairs AF-SS-4, AF-SS-5, and AF-SS-8; 44°C for 20 s for primer pair AF-SS-19; 50°C for 20 s for primer pair AF-SS-20) and 72°C for 30 s; 18 cycles of 98°C for 15 s (70°C for 15 s for primer pairs AF-SS-4, AF-SS-5, and AF-SS-8; 63°C for 15 s for primer pair AF-SS-19; 66°C for 15 s for primer pair AF-SS-20) and 72°C for 30 s; and 72°C for 9 min for round 1, and 98°C for 1 min; 2 cycles of 98°C for 10 s, 54°C for 20 s, and 72°C for 30 s; 4 cycles of 98°C for 15 s, 70°C for 15 s, and 72°C for 30 s; and 72°C for 9 min for round 2. The dominant amplicons (250 to 275 bp) containing a mixture of spacer sequences were excised following agarose electrophoresis (3%, 4.2 V/cm, 2 h) of round 2 PCR products. Libraries were quantified by Qubit and sequenced with Illumina MiSeq v3 kits (150 cycles for read 1; 8 cycles for index 1; 8 cycles for index 2).Spacers were trimmed from reads using a custom python script and were considered identical if they differed by only 1 nucleotide. Protospacers were mapped using Bowtie 2.0 (–very-sensitive-local alignments). These methods preserve strand information.Preparation of cellular and extracellular fractions from Spirulina samples. Prepackaged Spirulina samples were purchased from various vendors as described for Fig. 1. A similar probiotic blend containing a variety of plant matter but lacking A. platensis was also tested as a negative control for CRISPR spacer amplification.Spirulina metagenomic samples were collected in 50-ml polypropylene centrifugal tubes (Corning) from open-air raceway ponds operated by Earthrise LLC, Calipatria, CA, and were transported on ice to our laboratory for processing the same day (without freezing). Spirulina at these farms is grown in an interconnected network of open-air ponds, containing approximately 1 million liters of water seeded with inorganic nutrients and injected with carbon dioxide to support the high growth rate of the enriched cyanobacterial culture. The culture is kept in continuous circulation between ponds using paddle wheels and is maintained continuously from April through October. Growth of unwanted “weed algae” is prevented by raising the pH of the culture to leverage the rare ability of A. platensis to grow in alkaline environments.Cyanobacteria were pelleted from ~120 ml of pond water by centrifugation at 4,000 × g for 1 h (Beckman Allegra X-15R) (4°C). The supernatant was then divided into four 30-ml polypropylene high-speed centrifugal tubes (Nalgene Oak Ridge) and subjected to a preclearing spin at 12,000 × g for 1 h (Avanti J-25I centrifuge with JA-17 rotor; Beckman Coulter, Inc.) (4°C). The cleared sample was further subdivided into 15-ml open-top Polyallomer tubes (Seton), and extracellular material was collected by ultracentrifugation at 200,000 × g for 16 h (Optima XE-90 Ultracentrifuge with SW41 Ti rotor; Beckman Coulter, Inc.) (4°C) with and without prior filtration through a 0.45-µm-pore-size Polysulfone membrane (Pall Corp.).Nucleic acid extraction from Spirulina. The cyanobacteria marketed as Spirulina are typically subjected to cold compression into pellets sold as food supplements, or flash-dried and sold as a powder intended to be mixed into kitchen recipes. Genomic DNA would be expected to remain intact through the packaging process, which eschews heat and mechanical granulation. Genomic DNA was extracted from Spirulina grocery store samples and metagenomic cellular fractions as previously described (49). RNA was extracted from Spirulina grocery store samples and metagenomic cellular fractions using a combined TRIzol/RNeasy method (16).DNA extractions from metagenomic extracellular fractions were performed using a modified SDS/protease K method. Briefly, pellets were resuspended in 100 μl of lysis buffer (10 mM Tris, 20 mM EDTA, 50 μg/ml protease K, 0.5% SDS) and incubated at 56°C for 1 h. DNA was precipitated with the addition of isopropanol at up to 50% of the total volume. DNA pellets were washed with 70% ethanol and resuspended in 10 mM Tris (Qiagen) (pH 8.5). Metagenomic extracellular DNA samples were prepared by two methods: with and without RQ1 RNase-Free DNase (Promega) pretreatment of the ultracentrifuged pellet (per the manufacturer’s instructions). The data obtained through the two methods were similar.RNA samples from metagenomic extracellular fractions were prepared by two methods: with and without RQ1 DNase pretreatment of the ultracentrifuged pellet. RNA was extracted from the pellets using TRIzol (Life Technologies, Inc.) per the manufacturer’s instructions. Purified RNA was treated with RQ1 RNase-Free DNase, which was subsequently removed by extraction performed with a 1:1 mixture of acidified phenol (Ambion) and chloroform (Fisher Scientific), followed by an extraction performed with chloroform and precipitation of RNA from the aqueous phase through the addition of ethanol at up to 70% of the total volume. RNA pellets were washed with 70% ethanol and resuspended in RNase-free water (Qiagen).DNA sequencing of Spirulina samples. Genomic DNA extracted from Spirulina grocery store samples and metagenomic cellular fractions was prepared for high-throughput sequencing using a Nextera DNA Library Prep kit (Illumina) according to the manufacturer’s instructions.RNA sequencing of Spirulina samples. Three different methods were employed for RNA sequencing. The first was described previously (16); this method provides unbiased sequencing, especially of shorter RNA fragments that may be missed by other protocols. The second was carried out according to the instructions provided with a SMARTer Stranded RNA-Seq kit (Clontech); this method was especially useful in generating libraries from low-concentration RNA samples (e.g., Spirulina metagenome extracellular RNA fractions). The third method was developed previously in our laboratory for detecting low-abundance RNA species. Up to 100 ng of RNA was diluted to 6 μl in RNase-free water and incubated at 90°C for 1 min and then at 70°C for 4 min and was transferred to ice for 2 min. A 13.5-μl volume of reverse transcription master mix (4 μl 5x First Strand buffer, 2 μl 0.1 M dithiothreitol, 2 μl 10 mM deoxynucleoside triphosphate dNTP mix, 1 μl RNase-Out, 1.5 μl Superscript II reverse transcriptase, 3 μl RNase-free water all components from Life Technologies, Inc.) was added to each sample, and the mixture was incubated at 42°C for 30 min. A 0.5-μl volume of 100 ng/μl exonuclease-resistant random hexamers (Fisher Scientific) was then added, and the reaction mixtures were incubated at 25°C for 2 min and at 42°C for an additional 60 min. The reaction was terminated by heating to 95°C for 5 min. Subsequently, 127.5 μl of multiple-displacement amplification (MDA) master mix (6 μl of 25 mM dNTP mix Roche, 15 μl of 10x Phi29 DNA polymerase reaction buffer NEB, 7.5 μl of 1 μg/μl exonuclease-resistant random hexamers, 7.5 μl of 0.1 M dithiothreitol, 93 μl of RNase-free water) was added to each reaction, and the mixture was split into 3 tubes at 47.5 μl each and incubated at 95°C for 5 min and then at 4°C during the addition of 2.5 μl of Phi29 DNA polymerase (NEB) to each tube. The reaction mixtures were incubated at 30°C for 6 to 8 h. The MDA product was purified using Zymo DNA clean-and-concentrator columns and prepared for sequencing using a Nextera DNA Library Prep kit.RNA from cellular fractions of the Spirulina metagenome was sequenced using all three methods. RNA samples from extracellular fractions (with and without filtration using 0.45-µm-pore-size filters and with and without DNase treatment) were sequenced using only the SMARTer Stranded and MDA methods as there was not enough input material for the small-RNA sequencing method described in reference 16. The samples processed via the SMARTer Stranded protocol were prepared with and without the built-in RNA fragmentation step in an effort to capture shorter RNA fragments.Computational analyses of Spirulina datasets. CRISPR spacers were trimmed from high-throughput sequencing reads and clustered to account for sequencing errors, with 1 allowed mismatch on the Illumina MiSeq platform and 2 allowed mismatches on the Illumina HiSeq platform. All searches for sources of spacer sequences were carried out using NCBI blast package 2.2.25, with a culling limit of 1 and an empirically determined e-value cutoff for each dataset to minimize false negatives as reported in the text. Blast databases were formatted using formatdb. Preformatted nucleic acid datasets (NT; nucleotide collection; posted 9 February 2013) and protein datasets (NR; ll nonredundant GenBank CDS translations plus PDB plus SwissProt plus PIR plus PRF, excluding environmental samples from whole-genome sequencing WGS projects; posted 13 March 2015) were obtained from NCBI. HMM searches for RdRP-related ORFs were carried out using HMMER 3.1 (50), and phage-like contigs were identified using the PHASTER online interface (51). Contigs were assembled using both Velvet 1.1.07 (velveth run with a maximum Kmer length of 31 and velvetg with a minimum contig size of 200) (52) and SPAdes 3.7.1 (53) in metagenomic mode. Alignments of metagenomic sequencing reads and CRISPR spacers to the reference genome(s) were carried out using bowtie 2.2.6 using the –very-sensitive-local option. The bedtools 2.25.0 merge program was used to collapse redundant alignments on the basis of the location. Custom python scripts were written for “greedy” spacer assembly, clustering of spacer sequences, translation of putative ORFs in metagenomic contigs, and curation of BLAST result files.Metagenomic protospacer analysis. For broader metagenomic searches, the CRISPRfinder (54) and PILER-CR (55) programs were used with default parameters to identify CRISPR arrays found in Cas7f and TnsA/TnsD loci. The MEGABLAST program (56) (word size, 18; otherwise, default parameters) was used to search for protospacers in the virus subset of NR database and the prokaryotic genome database. We considered only those matches with 95% identity and 95% length coverage or better with respect to the NR database.Accession number(s). Sequencing data have been deposited at SRA (SRP107814).
Article TitleOn the Origin of Reverse Transcriptase-Using CRISPR-Cas Systems and Their Hyperdiverse, Enigmatic Spacer Repertoires
Cas1 integrase is the key enzyme of the clustered regularly interspaced short palindromic repeat (CRISPR)-Cas adaptation module that mediates acquisition of spacers derived from foreign DNA by CRISPR arrays. In diverse bacteria, thecas1gene is fused (or adjacent) to a gene encoding a reverse transcriptase (RT) related to group II intron RTs. An RT-Cas1 fusion protein has been recently shown to enable acquisition of CRISPR spacers from RNA. Phylogenetic analysis of the CRISPR-associated RTs demonstrates monophyly of the RT-Cas1 fusion, and coevolution of the RT and Cas1 domains. Nearly all such RTs are present within type III CRISPR-Cas loci, but their phylogeny does not parallel the CRISPR-Cas type classification, indicating that RT-Cas1 is an autonomous functional module that is disseminated by horizontal gene transfer and can function with diverse type III systems. To compare the sequence pools sampled by RT-Cas1-associated and RT-lacking CRISPR-Cas systems, we obtained samples of a commercially grown cyanobacterium—Arthrospira platensis. Sequencing of the CRISPR arrays uncovered a highly diverse population of spacers. Spacer diversity was particularly striking for the RT-Cas1-containing type III-B system, where no saturation was evident even with millions of sequences analyzed. In contrast, analysis of the RT-lacking type III-D system yielded a highly diverse pool but reached a point where fewer novel spacers were recovered as sequencing depth was increased. Matches could be identified for a small fraction of the non-RT-Cas1-associated spacers, and for only a single RT-Cas1-associated spacer. Thus, the principal source(s) of the spacers, particularly the hypervariable spacer repertoire of the RT-associated arrays, remains unknown.