MATERIALS AND METHODSSample collection. Five individuals were asked to participate in this study, and swab samples were collected for spacer sequencing. The participants provided information regarding their gender, age range, systemic antibiotics usage in the past 6 months, steroid usage in the past 6 months, skincare product usage in the past 24 h, and handwashing in the past 2 h. Data Set S1, sheet 9, in the supplemental material summarizes the detailed information of samples. Skin surfaces (approximately 5 by 5 cm) were swabbed with cotton-tipped swabs soaked in water. The Hp, Ac, and Ra, representing dry, moist, and sebaceous sites, respectively, were selected for swabbing (2). For the initial sampling (time point 1), only an Hp sample was obtained, except for one individual (P05). Two to 3 months later (time point 2), three skin sites (Hp, Ac, and Ra) were sampled for all individuals. After swabbing, the cotton tips were separated and placed in 2-ml tubes. A premoistened swab tip placed directly in the 2-ml tube was used for a blank for DNA extraction (EC). Samples were stored at −20°C before DNA extraction. All procedures involving human participants were approved by the Institutional Ethics Committee of the National Research Institute of Police Science.Metagenomic data sets. Metagenomic data sets from the skin of the palm were downloaded from the European Nucleotide Archive in fastq format. The samples used in this study have been registered under BioProject PRJNA46333. Data collection methods were performed in accordance with the method previously reported by Oh et al. (2). Samples containing multiple runs were merged before assembly.Identification of CRISPR arrays using metagenomic data sets and isolation of CRISPRs conserved across multiple individuals. The scheme for CRISPR reconstruction from the metagenomic reads and isolation has been summarized in Fig. S6. Reads were filtered using Trimmomatic 0.39 to remove low-quality reads (66), bases with low-quality scores (Q < 20) were trimmed from both ends, and short reads (<50 bp) were removed. As the file was too large to assemble, 8 × 107 reads (22%) were subsampled from the largest data set (SRX743731) using seqtk (https://github.com/lh3/seqtk). De novo assembly of the metagenome was performed using MEGAHIT v1.1.4 (39). The data sets were assembled with default parameters, except that the k-mer sizes were fixed as –k-list 29,39,59,79,99. Output contigs were analyzed using the PILER-CR v1.06 with default parameters to detect the CRISPR arrays (40). Each repeat in a CRISPR array detected by the PILER-CR was converted to a single read. All reads were then duplicated, and one of the read pairs was reverse complemented, as the orientation of repeat sequence was unclear.FIG S6Schematic of CRISPR reconstruction from the metagenomic reads and identification of shared CRISPR repeats. De novo assembly of the metagenome was performed with MEGAHIT v1.1.4. Output contigs were analyzed using PILER-CR v1.06 with default parameters to detect the CRISPR arrays. All reads were then duplicated, and one of the read pairs was reverse complemented. Repeat reads were then dereplicated and clustered into OTUs by QIIME 2 v2019.7. The duplicated reverse complement repeat reads were discarded after OTU clustering. Download FIG S6, PDF file, 0.2 MB.Copyright © 2021 Toyomane et al.This content is distributed under the terms of the Creative Commons Attribution 4.0 International license.Repeat reads were then analyzed using QIIME 2 v2019.7 (45). First, dereplication and de novo OTU clustering at 90% were performed using the q2-vsearch plug-in (67). The repeat reads were detected in fewer than two individuals, and the duplicated reverse complement reads were discarded. The representative CRISPR repeats of each OTU were annotated by BLAST using default settings (68). The subject with the highest max score was defined as the host species. Since CRISPRs are mobile genetic elements, multiple species were often found to share the same repeat (69); in such cases, one representative species was recorded. Eventually, heat maps of the repeat reads on the OTU table were constructed using the heatmap visualizer of the q2-feature-table plug-in (70). The type and orientation of CRISPRs were determined by a search for the same repeats in CRISPRCasdb (44).We also utilized CRISPRCasFinder to identify cas genes associated with the CRISPR locus found in metagenomic contents (42). The contigs assembled by MEGAHIT were used as initial inputs. We initially extracted the contigs that contained CRISPRs with evidence level 4 and were >1,000 bp in length using CRISPRCasFinder with default settings. The evidence level was determined by CRISPRCasFinder. Subsequently, the extracted contigs were subjected to CRISPRCasFinder again to determine if the contigs contained cas genes and to subtype the identified cas genes based on the nomenclature proposed by Makarova et al. (41) using CRISPRCasFinder with the -cas parameter. Three representative contigs containing S. equinus CRISPR arrays were compared with the genomes of Streptococcus strains (Data Set S1, sheet 10) using CRISPR Visualizer (71).Measurement of DRs in the metagenomic data set by BLAST search. The number of reads containing each repeat in a data set was measured using BLAST. The sequences of repeats were used as queries, and the skin metagenome data sets in the Sequence Read Archive (SRA) of NCBI were used as subjects. The data sets used as subjects were identical to those used in the CRISPR reconstruction, except for samples from nares or cheek. Two samples (SRX740760 and SRX740761) from nares or cheek, representing a moist or oily site, respectively, of the same individual (HV07) were also included to evaluate repeat variation at different skin sites from an individual. Default parameters were used for the short-read search except that the maximum number of target sequences was set as 5,000. Reads with >90% identity and >90% query cover were included in the downstream analysis. A hierarchically clustered gradient heat map of repeat frequency was plotted using the clustermap method in the Seaborn package 0.9.0 (https://seaborn.pydata.org/generated/seaborn.clustermap.html) in Python 3.6.8.DNA extraction. Genomic DNA was extracted from swabs using the DNeasy PowerSoil kit (Qiagen, Aarhus, Denmark) with some minor modifications (1, 72). Briefly, frozen cotton tips were transferred to the bead tube provided with the DNeasy PowerSoil kit, followed by incubation at 70°C for 15 min after the addition of the C1 solution. Tubes were then horizontally shaken at 3,200 rpm for 2 min on a bead beater-type homogenizer μT-12 (TAITEC, Saitama, Japan). Remaining steps were performed according to the manufacturer’s instructions. Genomic DNA was eluted in 100 μl of the C6 solution.Amplification and sequencing of spacers. To amplify the spacer regions of the putative CRISPRs, three pairs of primers were designed based on the specificity of the CRISPR repeat, namely, S. equinus, S. thermophilus, and M. luteus (Data Set S1, sheet 11). These primer pairs are specific to the 3′ region of each CRISPR repeat so that the primer pair flanks a spacer. The length of primers was determined to meet the melting temperature requirement on the Illumina 16S Metagenomic Sequencing Library Preparation protocol (https://www.illumina.com/content/dam/illumina-support/documents/documentation/chemistry_documentation/16s/16s-metagenomic-library-prep-guide-15044223-b.pdf). PCR was performed in a 50-μl reaction volume using the KOD FX (Toyobo, Osaka, Japan). Each reaction mixture consisted of 1× PCR buffer for KOD, 0.4 mM deoxynucleoside triphosphate (dNTPs), 0.3 μM each primer, 1.0 U of KOD FX, and 2 μl of the DNA template. EC and a nontemplate control (NTC) for PCR were included as negative controls. The following conditions were used for amplification: 2 min at 94°C, 35 cycles of denaturation at 95°C for 30 s, annealing at 60°C for 30 s, and extension at 72°C for 30 s, followed by final extension at 72°C for 5 min. Next, 25 μl of the products was purified using 1.8× volume of the Agencourt AMPure XP kit (Beckman Coulter, Brea, CA) twice and washed using freshly prepared 80% ethanol. DNA was eluted in 52.5 μl of 10 mM Tris (pH 8.5) and 0.1% Tween 20 elution buffer. If low-molecular-weight bands were observed using agarose gel electrophoresis, additional purification was performed using 1.8× volume of the Agencourt AMPure XP kit. Purified DNA was indexed using the Nextera XT DNA index kit (Illumina, San Diego, CA) and KOD FX in a reaction volume of 50 μl. Each reaction mixture consisted of 1× PCR Buffer for KOD, 0.4 mM dNTPs, 4.5 μl of each index primer, 1.0 U of KOD FX, and 5 μl of the purified DNA as the template. The index PCR product was purified using 1.8× volume of the Agencourt AMPure XP kit and eluted in 27.5 μl of 10 mM Tris (pH 8.5) and 0.1% Tween 20 elution buffer. The indexed libraries were quantified using GenNext NGS library quantification kit (Toyobo) and then diluted to 4 nM with an elution buffer. If the library concentration was less than 4 nM, the library was pooled with other libraries without dilution. Denaturing and final dilution were performed according to the manufacturer’s instructions. Each run included >15% of the PhiX control v3 (Illumina) to improve sequencing quality. Paired-end sequencing of 300 cycles (2 × 151) was performed on a MiSeq platform (Illumina) with the MiSeq v2 chemistry, according to the manufacturer’s instructions.Amplification and sequencing of 16S rRNA. Sequence libraries of the 16S rRNA were prepared according to the Illumina 16S metagenomic sequencing library preparation protocol with some modifications. The primer pair targeting the V1-V3 region (73, 74) was used in this study (Data Set S1, sheet 11). PCR was performed in a 25-μl reaction volume using the KAPA HiFi HotStart ReadyMix kit (KAPA Biosystems, MA). Each reaction consisted of 1× KAPA HiFi HotStart ReadyMix, 0.3 μM each primer, and 2 μl of the DNA template. The following conditions were used for amplification: 5 min at 95°C, 25 cycles of denaturation at 95°C for 30 s, annealing at 62.3°C for 30 s, and extension at 72°C for 30 s, followed by final extension at 72°C for 5 min. Then, 25 μl of the products was purified using 0.8× volume of the Agencourt AMPure XP kit and washed using freshly prepared 80% ethanol. DNA was eluted in 52.5 μl of 10 mM Tris-HCl (pH 8.5) and 0.1% Tween 20 elution buffer. Purified DNA was indexed using the Nextera XT DNA index kit (Illumina) and the KAPA HiFi HotStart ReadyMix kit in a reaction volume of 50 μl. Each reaction mixture consisted of 1× KAPA HiFi HotStart ReadyMix, 5 μl of each index primer, and 5 μl of the purified DNA as the template. The index PCR product was purified using 1.1× volume of the Agencourt AMPure XP kit and eluted in 27.5 μl of 10 mM Tris (pH 8.5) and 0.1% Tween 20 elution buffer. The indexed libraries were quantified by the GenNext NGS library quantification kit and diluted to 4 nM with elution buffer. If the library concentration was less than 4 nM, the library was pooled with other libraries without dilution. Denaturing and final dilution were performed according to the manufacturer’s instructions. Each run included >15% of the PhiX control v3 to improve sequencing quality. Paired-end sequencing of 600 cycles (2 × 301) was performed on a MiSeq platform (Illumina) with MiSeq v3 chemistry, according to the manufacturer’s instructions.Data processing for spacer analysis. Initial quality controls were carried out using the CLC Genomics Workbench 12.0.2 (Qiagen). After low-quality sequences (Q < 20) and primer sequences were removed, reads shorter than 20 bp or longer than 40 bp were filtered. The remaining spacer sequences were then analyzed using QIIME 2 v2019.7 (45). Spacer reads were denoised and dereplicated with the denoise-paired method of the q2-dada2 plug-in (75) with min_fold_parent_over_abundance = 32; the resulting ASV tables were further analyzed. Rarefaction curves were created with the q2-diversity plug-in with 100 iterations and 40 steps at a maximum depth of 10,000 reads. Based on rarefaction curve analysis, each sample was rarefied to 4,829 reads for normalization in further analysis, and samples with fewer than 4,829 spacer ASVs were omitted from the calculation (76). The EC and NTC were removed at this step; however, EC was included in S. equinus spacer analysis. Shannon indices were calculated in bits with the q2-diversity plug-in, and Kruskal-Wallis tests were performed to compare Shannon indices between samples (77). For calculation of the beta-diversity index, the Bray-Curtis dissimilarity index was calculated by the q2-diversity plug-in and used for PCoA plots. Mann-Whitney U test was calculated using the “stats.mannwhitneyu” function of the python library “SciPy” (version 1.3.0) to compare the Bray-Curtis dissimilarity indices. Hill numbers were calculated using the library “vegan” (version 2.5-4) in R (version 3.4.4) (78).To compare the CRISPR arrays in the samples, we reconstructed the arrays from spacer amplicons. The trimming and assembling described below were performed using the CLC Genomics Workbench 12.0.2. First, reads that were longer than 100 bp were extracted after low-quality sequences (Q < 20) were removed so that reads contained two spacers per read. Then, these reads were de novo assembled by “Map reads back to contigs” mode with a length fraction of 0.8 and similarity fraction of 0.9. These assembled contigs were then compared and visualized using CRISPR Visualizer (71).The spacer sequences were subjected to BLAST analysis with default settings. If the reads had E values of less than 0.01, they were further classified. If the description included “phage” or “virus,” the query was classified as “virus.” If the description included name of host genus (“streptococcus” for S. equinus or S. thermophilus and “micrococcus” for M. luteus), the query was classified as “host.” Remaining queries with significant hits were classified as “other.”To identify a subset of predictive spacers and to improve interpretability of the classification, we performed feature selection using support vector machine based on recursive feature elimination (SVM-RFE). SVM-RFE is a machine-learning technique used to select a subset of features (e.g., genes associated with a specific disease) which are relevant to sample types from a large set of features, by recursively ranking the features and eliminating irrelevant features (79). In this study, SVM-RFE was implemented using the python library “Scikit-learn” (version 0.22.1) (80). For SVMs, the linear kernel was used with default parameters. Using SVM-RFE, we selected 20 ASVs from the data set, including all samples from an individual. Then, using the selected 20 features, we predicted the individuals by linear-kernel SVM using leave-one-out cross-validation to measure predictive accuracy. However, we did not perform nested cross-validation to optimize the selection of ASVs used in the linear-kernel SVM. This is because it is difficult to split the data twice for inner and outer cross validation due to the small sample size, which may cause overfitting of the data.Data processing for 16S rRNA analysis. Reads were analyzed using QIIME 2 v2019.7 (45). Initially, 16S rRNA reads were denoised and dereplicated with the denoise-paired method of the q2-dada2 plug-in (75). The parameters of trim_left_f equals 20 and trim_left_r equals 20 were set to trim primer sequences. The resulting ASV tables and representative reads were further analyzed. Representing reads were annotated by classify_sklearn methods of the q2-feature-classifier plug-in with the SILVA 16S rRNA database v132 (80, 81). We observed contaminants derived from Sphingomonadaceae and Pseudomonadaceae, which are known as common contaminants (82), in both swab samples and NTC due to low biomass. Therefore, we removed reads annotated as Sphingomonadaceae or Pseudomonadaceae from the tables and representative reads using filter-table and filter-seqs methods of the q2-taxa plug-in. The genus-level compositions of the top 11 abundant ASVs were visualized as a bar plot using the q2-taxa plug-in. For the calculation of the beta-diversity index, the Bray-Curtis dissimilarity index was calculated using the q2-diversity plug-in and then used for PCoA plots. Mann-Whitney U test was performed using the “stats.mannwhitneyu” function of the python library “SciPy” (version 1.3.0). Feature selection and individual prediction were performed using SVM-RFE with linear kernel, as described above.Data availability. Spacer sequences are available in the SRA database under accession numbers DRA009650 and DRA010353. Meta 16S rRNA sequences are available in the SRA database under accession number DRA010342.
Article TitleEvaluation of CRISPR Diversity in the Human Skin Microbiome for Personal Identification
Spacer sequences are available in the SRA database under accession numbersDRA009650andDRA010353. Meta 16S rRNA sequences are available in the SRA database under accession numberDRA010342.