Materials and Methods

Metapopulation Structure of CRISPR-Cas Immunity inPseudomonas aeruginosaand Its Viruses

MATERIALS AND METHODSHost data set selection. The set of P. aeruginosa strains analyzed in this paper includes data from several sources. Reads associated with 458 P. aeruginosa strains cultured from patient samples collected from the Copenhagen Cystic Fibrosis Center at the University Hospital, Rigshospitalet, Denmark (24), were retrieved from the NCBI Sequence Read Archive (accession no. ERP004853). Assembled genomes of 24 P. aeruginosa strains described in reference 43 were kindly provided by the authors (GenBank accession no. {"type":"entrez-nucleotide","attrs":{"text":"AWYJ00000000","term_id":"564766242","term_text":"AWYJ00000000"}}AWYJ00000000 to {"type":"entrez-nucleotide","attrs":{"text":"AWZG00000000","term_id":"564812319","term_text":"AWZG00000000"}}AWZG00000000). Assembled genomes of 388 strains described in reference 44 were obtained from GenBank (BioProject accession no. PRJNA264310). All other complete and draft-stage P. aeruginosa genomes were retrieved from the NCBI Nucleotide database in September 2014 (310 genomes; accession numbers in Table S1). CRISPR arrays from reference 19 were downloaded from NCBI (45 sequences). Three additional sets of CRISPR arrays were obtained from metagenomic sequence of three CF sputum samples kindly provided by Katrine Whiteson and Yan Wei Lim. Metadata including isolation location, sampling date, environment, and epidemic strain status were collected where possible (Table S1).Quality filtering and genome assembly. For all samples with sequencing reads available, reads were trimmed and quality filtered using Prinseq 0.20.4 (45). Reads were trimmed from both ends using a 5-nt sliding window with a minimum quality score of 30. Reads were retained if they had a mean quality score of 30 and <1% ambiguous bases. The minimum read length was set to two-thirds the anticipated read length, or 66 nt. Draft assemblies were generated with MIRA 4.0 (46) using genome, de novo, and accurate parameters.CRISPR identification and spacer extraction. CRISPR arrays were identified via BLASTn of known P. aeruginosa CRISPR repeats (19). Parameters were adjusted for short search sequence and to maximize hits covering the entire repeat length as follows: “-word _size 7 -gapopen 3 -gapextend 2 -reward 1 -penalty -1.” The minimum percent identity was set to 80 to allow for degenerate repeat sequences. Hits <24 bp were filtered from the results. Sequences with a repeat of the same type both up- and downstream in the same orientation and <40 bp away from other hits were considered spacers and extracted. A spacer rarefaction curve was computed in QIIME (47).CRISPR array ranges were declared as all consecutive repeats and spacers in the same orientation <500 bp away from one another. Groups of repeats and spacers on different contigs, on the same contig/genome in different orientations, or on the same contig/genome but separated by >500 bp were considered separate arrays.For samples with reads available, CRISPR arrays were further verified for accuracy and completeness using a technique called nonassembled repeat boundary linkage, or NARBL (http://github.com/englandwe/NARBL). To establish spacer order, repeats were identified on sequence reads, and 12-nucleotide “chunks” of DNA flanking each repeat were identified using fuzznuc (48); up to 8 mismatches to the repeat sequence were permitted. When the repeat was matched in both orientations due to palindromic repeats, the match with fewer mismatches was kept. Chunks that were a perfect match to the repeat sequence (i.e., from adjacent or partial repeats) were also discarded. Finally, singleton chunks that perfectly overlap nonsingleton chunks by at least 8 bp were removed, to account for rare chunks generated by sequencing error.Occurrences of two or more chunks on the same read were recorded as links, which represent either two ends of the same spacer or opposite ends of two spacers linked across a repeat. The first type was used to identify spacer sequences; the second, to order spacers. Linkage networks were analyzed using Cytoscape (49). Based on average repeat and spacer lengths of species with previously sequenced CRISPR arrays, links spanning a single repeat-spacer unit were considered short links, spanning only a single spacer or pair of adjacent spacers; longer links were considered to span multiple spacers and were not counted when determining coverage of links. All spacer sequences used in this study can be found in Table S1.Multilocus sequence typing. An established panel of seven markers (25) was used for MLST analysis. MLST loci were identified by BLASTn (50) of a representative known allele obtained from the Pseudomonas aeruginosa PubMLST website (http://pubmlst.org/paeruginosa/) (51) against genomes or contigs. The best BLAST hit for each MLST locus was then BLASTed against a database of all known alleles for that locus, also from the PubMLST website. Exact matches to a known allele were assigned that allele’s ID number; hits with lower identity or incomplete coverage of the locus were investigated manually, and any identified as novel alleles were assigned new ID numbers of >10,000. Strains with inconclusive MLST alleles were removed from further MLST analysis. A maximum-likelihood tree of concatenated MLST markers was constructed with RAxML (52) using the rapid bootstrapping algorithm plus maximum likelihood and GTRgamma nucleotide substitution model with 100 bootstrap replicates.Virus data set selection and protospacer identification. Genomes of all viruses identified as infecting P. aeruginosa were downloaded from the NCBI Nucleotide database on June 23, 2015, totaling 92 unique viruses. Six previously identified proviruses from P. aeruginosa LESB58 were added using genomic coordinates from reference 12. All viruses were classified according to lifestyle (lytic, temperate, nonlytic, or unknown) based on literature descriptions. These 98 viruses and proviruses were used for all virus-related analyses.Protospacers in virus genomes were identified via BLASTn of spacer sequences. The parameter “-task blastn-short” was used due to short query length. A minimum E value of 0.01 was used to capture incomplete and imperfect matches, allowing up to four mismatches over a full-length match. PAMs were identified and partial-length matches were extended to cover the full spacer length using clDB (53), and the Hamming distance between protospacer and spacer was calculated. Protospacer matches were kept if a correct PAM sequence was present. Acceptable PAM sequences were GG or TTC, indicative of type 1-F and type 1-E PAMs, respectively. Any matches with a Hamming distance of >3 were filtered out of analysis. Spacers matching protospacers on more than one distinct cluster were designated “superspacers.”Assignment of viruses to genome clusters. To assign viruses to clusters, all virus genomes were compared using BLASTn (E < 0.001). For each pair of genomes, the proportional length alignment (PLA), or total length aligned by BLAST over the length of the query, was calculated and used as our measure of viral similarity. MCL (54) was used to cluster viruses into networks with edges weighted by PLA with a minimum PLA cutoff of 0.2.Distributed immunity and susceptibility index. Population distributed immunity (PDI) was calculated on a per-virus basis using all possible pairs of hosts. For each host-host pair, if each host has a spacer matching the virus which is not present in the other host, PDI is 1; else, PDI is 0. At the population level, PDI is then averaged across all host-host pairs. Criteria for matching spacers and protospacers are as described above. Individual distributed immunity (IDI) is measured as a count of spacers in a host matching a virus. At the population level, IDI is averaged across all hosts. The susceptibility index (SI) is the number of host-virus pairs where the host is not immune to the virus divided by the total number of host-virus pairs. Immunity is defined as a spacer-protospacer match as described above.Statistics. All statistical tests were performed in R versions 3.2.2 to 3.2.4 (55). Games-Howell tests were performed using the userfriendlyscience package (56). Plots were generated in R using the ggplot2 package (57).

Article TitleMetapopulation Structure of CRISPR-Cas Immunity inPseudomonas aeruginosaand Its Viruses

Abstract

The set ofP. aeruginosastrains analyzed in this paper includes data from several sources. Reads associated with 458P. aeruginosastrains cultured from patient samples collected from the Copenhagen Cystic Fibrosis Center at the University Hospital, Rigshospitalet, Denmark (24), were retrieved from the NCBI Sequence Read Archive (accession no.ERP004853). Assembled genomes of 24P. aeruginosastrains described in reference43were kindly provided by the authors (GenBank accession no.{"type":"entrez-nucleotide","attrs":{"text":"AWYJ00000000","term_id":"564766242","term_text":"AWYJ00000000"}}AWYJ00000000to{"type":"entrez-nucleotide","attrs":{"text":"AWZG00000000","term_id":"564812319","term_text":"AWZG00000000"}}AWZG00000000). Assembled genomes of 388 strains described in reference44were obtained from GenBank (BioProject accession no.PRJNA264310). All other complete and draft-stageP. aeruginosagenomes were retrieved from the NCBI Nucleotide database in September 2014 (310 genomes; accession numbers inTable S1). CRISPR arrays from reference19were downloaded from NCBI (45 sequences). Three additional sets of CRISPR arrays were obtained from metagenomic sequence of three CF sputum samples kindly provided by Katrine Whiteson and Yan Wei Lim. Metadata including isolation location, sampling date, environment, and epidemic strain status were collected where possible (Table S1).


Login or Signup to leave a comment
Find your community. Ask questions. Science is better when we troubleshoot together.
Find your community. Ask questions. Science is better when we troubleshoot together.

Have a question?

Contact support@scifind.net or check out our support page.