Genomic data processing and assembly
Reads from 601 B. fragilis isolates from the Zhao et al. study 40 were downloaded from the NCBI BioProject Accession PRJNA524913, henceforth referred to as the ‘Zhao2019 dataset’. Raw shotgun sequencing reads were trimmed using Trimmomatic v0.39 41 (parameters used: LEADING:5 TRAILING:5 SLIDINGWINDOW:4:10 MINLEN:60). Trimmed reads were then assembled using SPAdes v3.12 42 with default settings. FragGeneScan 43 was then used to predict protein coding genes of metagenome assemblies.
A total of 222 B. fragilis reference genomes, 16 complete and 202 draft genomes, were downloaded from the NCBI ftp website as of Jan 18, 2021. A list of genomes included in this analysis can be found at the CRISPRone website http://omics.informatics.indiana.edu/CRISPRone/Bfragilis.
Characterization of CRISPR-Cas systems
To identify CRISPR-Cas systems in B. fragilis genomes, we utilized CRISPRone 44 which predicts both CRISPR arrays and cas genes within a given input genome sequence. Predicted CRISPR-Cas systems were then further refined through a reference based approach. Repeat sequences of CRISPRone predicted spacers were extracted and clustered to obtain consensus reference repeats using CD-HIT-EST 45 with 85% sequence identity. Consensus reference repeats were then used as input for CRISPRAlign 46, a reference based approach to identify CRISPR arrays. As the exact boundaries of CRISPR arrays predicted by de novo approaches may sometimes be blurred due to small CRISPR arrays, sequencing errors, and mutations in repeat sequences, we utilize a reference based approach to redefine the repeat-spacer boundaries of CRISPR arrays predicted by CRISPRone.
To compare spacer sequences across different arrays, reduce spacer redundancy, and the eventual computation of spacer content heterogeneity, spacers were clustered with CD-HIT-EST 45 at 85% sequence identity. An 85% sequence identity was used to provide greater flexibility in spacer sequences, and allow for a small amount of sequence variation either due to sequencing error or real mutations found between individual spacers. Spacer sequences that clustered together were considered identical spacer sequences. Spacer clusters were reserved for downstream computation of spacer content heterogeneity (Figure 1A) and construction of compressed spacer graphs (Figure 1B).
- Download figure
- Open in new tab
Approaches used for the identification and refinement of the CRISPR arrays and construction of spacer graphs. (A) CRISPR arrays are analyzed in groups such that each group shares identical or very similar repeats (repeats are shown as diamonds and spacers are shown as boxes). CRISPR arrays containing no spacer variance or no CRISPR spacer hetergeneity are discarded. (B) Example of spacer sharing CRISPR arrays can be represented as a simplified graphical structure (spacer graph), in which the edges record the ordering of the spacers in arrays.
In some cases, it may be difficult to differentiate between true CRISPR-Cas systems and false positive CRISPR-Cas systems (e.g., false CRISPR-arrays, inactive CRISPR-Cas systems). While manual curation can help filter out some of these issues, it becomes difficult to screen out hundreds to thousands of genomes. To help filter out false positive arrays and inactive CRISPR-Cas systems, we propose a metric of heterogeneity to measure the rate of change (i.e., growth and turnover of spacers) in CRISPR arrays with the assumption that CRISPR arrays of active CRISPR-Cas systems undergo active expansion and turnover of spacers. Here we define spacer content heterogeneity score as: Where n is defined as the number CRISPR arrays, with each CRISPR array containing c_1, _c_2,…, _cn unique spacers (in some rare cases, CRISPR arrays may contain multiple copies of the same spacer, which will be considered as one spacer) and m denotes the number of unique spacers found from all n arrays combined. Spacer heterogeneity scores range from 0 to 1, where 0 indicates no spacer heterogeneity (i.e., two CRISPR arrays share all spacers), and 1 indicates the greatest possible extent of spacer content heterogeneity (i.e. two CRISPR arrays share no spacers).
Because spacer content heterogeneity alone is not enough to rule out false positive or inactive CRISPR-Cas systems, predicted CRISPR-Cas systems were further filtered out by coupling spacer content heterogeneity with gene content information. CRISPR groups that lack spacer content heterogeneity and had no adjacent cas genes were considered inactive or false positive, and thus discarded from further analysis; all filtered arrays were also manually inspected prior to their removal.
Compressed spacer graph for summarizing the sharing of spacers among a group of CRISPR arrays
Compressed spacer graphs 47 were constructed for each CRISPR-Cas type to summarize and illustrate CRISPR array dynamics. For every spacer in a given array, where each spacer was represented by a node of its representative spacer cluster, a directed edge was built between nodes of neighboring spacers in sequential order. Once all CRISPR arrays were represented in the graph structure, the spacer graph was then simplified by collapsing neighboring nodes if two neighboring nodes shared an “in-degree” and “out-degree” equal to or less than one (Figure 1B). Compressed spacer graphs highlight CRISPR array structure and dynamics (e.g. branching structures representing spacer gain and loss). Arrays that share no spacers result in disconnected components in the compressed spacer graph.
Mobile Genetic Element Databases
A collection of mobile genetic element (MGE) databases were gathered, including phage and plasmid databases. The phage databases included the Gut Phage Database 48, MicrobeVersusPhage 49 database, and the reference viral database 50. The plasmid databases included the Comprehensive and Complete Plasmid Database 51, and PLSDB 52. The phage and plasmid databases included sequences from the NCBI reference database, NCBI nucleotide database, MGEs identified from metagenomic assemblies, and prophages identified in prokaryotic genomes. We collectively refer to these databases as the ‘MGE database’ for simplicity.
Identification of CRISPR Targets
All unique spacer sequences extracted from B. fragilis’ CRISPR arrays were queried against the MGE database using BLASTN 53 to search for putative invaders that were targeted by B. fragilis. For this analysis, we used all unique spacers instead of 85%-similarity nonredundant set to increase the search sensitivity. Results were filtered to retain hits with a greater than 90% sequence identity, query coverage per hsp greater than 80%, and an e-value of less than 0.001. We noticed that even after dereplication by dRep 54 (with default parameters), there was still a large redundancy in the identified MGEs. Instead, we devised a greedy algorithm to select the minimum number of MGEs that collectively contain all protospacers matching the spacers. Similarly, we selected the minimum number of B. fragilis isolates that contained all identified spacers and only included them in the network. Selected MGEs and isolates are then used for building spacer-MGE and host-MGE networks. In the spacer-MGE network, spacer sequence clusters (called spacers for simplicity) and MGEs are represented as nodes and an edge is added between a spacer node and MGE node if the MGE contains a segment that matches the spacer (i.e., protospacer). In the host-MGE network, an edge is added to a host and a MGE if the host and MGE pair contain at least one matching protospacer and spacer. For MGEs that are phages (or prophages), we applied PhaGCN 55 to assign their taxonomic groups (ICTV 56 families). All visualizations and manual inspection of the networks were performed using Cytoscape 57.
Article TitleDiversity and dynamics of the CRISPR-Cas systems associated with Bacteroides fragilis in human population
CRISPR-Cas systems are adaptive immune systems commonly found in prokaryotes that provide sequence-specific defense against invading mobile genetic elements (MGEs). The memory of these immunological encounters are stored in CRISPR arrays, where spacer sequences record the identity and history of past invaders. Analyzing such CRISPR arrays provide insights into the dynamics of CRISPR-Cas systems and the adaptation of their host bacteria to rapidly changing environments such as the human gut. In this study, we utilized 601 Bacteroides fragilis genome isolates from 12 healthy individuals, 6 of which include longitudinal observations, and 222 available B. fragilis reference genomes to update the understanding of B. fragilis CRISPR-Cas dynamics and their differential activities. Analysis of longitudinal genomic data showed that some CRISPR array structures remained relatively stable over time whereas others involved radical spacer acquisition during some periods, and diverse CRISPR arrays (associated with multiple isolates) co-existed in the same individuals with some persisted over time. Furthermore, features of CRISPR adaptation, evolution, and microdynamics were highlighted through an analysis of host-MGE network, such as modules of multiple MGEs and hosts, reflecting complex interactions between B. fragilis and its invaders mediated through the CRISPR-Cas systems. This work demonstrates the power of using culture-based population genomics to reveal the activities and evolution of the CRISPR-Cas systems associated with gut bacteria in human population. We made available of all annotated CRISPR-Cas systems and their target MGEs, and their interaction network as a web resource at https://omics.informatics.indiana.edu/CRISPRone/Bfragilis.