Materials and Methods

Comparative Genomic Analysis ofMycobacteriaceaeReveals Horizontal Gene Transfer-Mediated Evolution of the CRISPR-Cas System in theMycobacterium tuberculosisComplex

MATERIALS AND METHODSGenome sequences and CRISPR-Cas classification. All available (141) genome sequences of Mycobacteriaceae covering five genera were downloaded from NCBI-RefSeq website on 24 August 2019. The genomic data and annotations were obtained from NCBI-FTP ( CRISPR loci were predicted using CRISPRCasFinder ( with default parameters (22). CRISPRCasFinder comprises of a rating system based on several features. Short candidate arrays made up of one to three spacers often do not correspond to real CRISPRs and are therefore given the lowest evidence level, level 1. Evidence levels 2 to 4 are attributed on the basis of the combined degrees of similarity of repeats and spacers. Arrays with evidence level 1 or 2 indicate potentially false-positive results and were not considered for our analysis. Additionally, all predicted loci were manually checked, and those located in coding regions were discarded. CRISPRmap (v1.3.0-2013) (23) was used to provide conserved motifs, family, and superclass based on structural and sequence similarities.Cas proteins were identified using CRISPRCasFinder and HMMER 3 (46) against a collection of 395 Cas protein profiles obtained from a previous study (15). cas genes were annotated and naming of cas genes, and their classification into types and subtypes was carried out as described by Makarova et al. (15). Cas proteins were also cross-verified from the respective NCBI genome annotations.16S rRNA gene sequence-based phylogenetic tree construction. 16S rRNA gene-based comparative phylogenetic analysis was performed for all downloaded genomes. 16S rRNA sequences were obtained from the NCBI genome annotation files of the downloaded genomes (downloaded on 24 August 2019; see Table S1a in the supplemental material). To create the tree, multiple sequence alignment of the 16S rRNA gene sequences corresponding to the 141 gene copies were performed by MAFFT v7 using default parameters (47). The alignment was used to compute a maximum likelihood phylogenetic tree using the GTR+G model in RAxML-NG v1.0.1 (48), and branch support was computed with 1,000 bootstrap replicates. The tree was midpoint rooted and visualized by iTOL software ( CRISPR spacer target identification and relatedness. All available complete genomes of phages and plasmids were downloaded from the NCBI ftp server on 7 July 2020. Redundant genomes were removed, and a database was constructed using the NCBI-BLAST+ 2.9.0 command-line tool. BLASTN was performed for all CRISPRCasFinder-identified spacers against NCBI phage and plasmid databases, with 90% identity and query coverage. Significant matches were summarized in bipartite networks with edges between spacers and their targets and visualized using the Cytoscape software (49). Edges between network nodes were assigned when a protospacer matching a spacer in a given host was identified in a phage.CRISPRStudio (50) was used to visualize the CRISPR locus using default parameters; it compares spacer sequences present in a CRISPR array and then clusters them based on sequence similarities. To identify unique spacers at default settings, it considers spacer pairs with ≤2 mismatches as identical. CRISPRStudio requires gff3 file format as an input generated by CRISPRDetect. CRISPR loci common to both CRISPRCasFinder and CRISPRDetect (51) were used for visualization by CRISPRStudio.STB phylogeny and sequence analysis. All available STB strains were downloaded from NCBI-RefSeq website on 24 August 2019. Core genome alignment was done using Roary v3.13.0 for all the genomes (52). Alignment also included genomes of M. tuberculosis H37RV (GenBank accession number {"type":"entrez-nucleotide","attrs":{"text":"NC_000962","term_id":"448814763","term_text":"NC_000962"}}NC_000962), M. tuberculosis RW-TB008 (accession number {"type":"entrez-nucleotide","attrs":{"text":"NZ_CP048071","term_id":"1836255873","term_text":"NZ_CP048071"}}NZ_CP048071), M. tuberculosis variant bovis (accession number {"type":"entrez-nucleotide","attrs":{"text":"NC_016804.1","term_id":"378769743","term_text":"NC_016804.1"}}NC_016804.1), and NTMs (M. kansasii accession number {"type":"entrez-nucleotide","attrs":{"text":"NZ_CP019888.1","term_id":"1178595760","term_text":"NZ_CP019888.1"}}NZ_CP019888.1 and M. marinum accession number {"type":"entrez-nucleotide","attrs":{"text":"NZ_HG917972","term_id":"914684479","term_text":"NZ_HG917972"}}NZ_HG917972). A maximum likelihood phylogenetic tree was constructed using the GTR+G model in RAxML-NG v1.0.1, and branch support was computed with 1,000 bootstrap replicates, using M. marinum as an outgroup. The tree was visualized using iTOL.Genomic island prediction. Genomic islands (GIs) were identified and analyzed by IslandViewer 4 (31), which integrates four different and accurate GI predictor tools: IslandPath-DIMOB, SIGI-HMM, IslandPick, and Islander. IslandViewer was used to analyze GIs in STB-A (GenBank accession number {"type":"entrez-nucleotide","attrs":{"text":"NC_015848.1","term_id":"340625033","term_text":"NC_015848.1"}}NC_015848.1) and M. tuberculosis genome (accession number {"type":"entrez-nucleotide","attrs":{"text":"NC_018143.2","term_id":"561108321","term_text":"NC_018143.2"}}NC_018143.2).Phylogenetic analysis of CRISPR repeats and Cas10 families. Consensus CRISPR repeat sequence of MTBC and M. canettii genomes was used as a query sequence against CRISPRmap repeat database (v1.3.0-2013). CRISPRmap program constructs a hierarchical cluster tree based on multiple sequence-structure alignment of repeat sequences and the minimum free energy (MFE) structures generated by LocARNA to find relatedness (53) and provide a consensus MFE structure. A separate alignment of CRISPR DR RNA sequence of M. tuberculosis, M. canettii, and S. thermophilus was performed by T-COFFEE at default parameters ( The sequence logo of CRISPR repeats of aligned family members was obtained using WebLogo (54).To find the closest relative of M. tuberculosis Cas10, the global collection of Cas10 sequence data described in a study by Makarova et al. (15), was downloaded from the Batch Entrez website ( Multiple sequence alignments were performed to align closely related sequences by MAFFT v7, and it was also used to merge these alignments. The phylogenetic tree was reconstructed using LG+G model in the IQTREE v1.6.2 (55). The same program was used for 1,000 bootstrap calculation. Phylum level classification was done manually with the help of the NCBI-Taxonomy browser ( for tree annotation. The tree was midpoint rooted and visualized using iTOL.Whole-genome core SNP phylogeny of M. tuberculosis lineages and CRISPR-Cas loci prediction. We downloaded M. tuberculosis lineages (lineages 1 to 8 and animal-adapted lineages) as reference data sets comprising of 48 genomes from three different studies, namely, 24 genomes from the study of Coll et al. (34), 18 genome sequences from Phelan et al. (35), five genome sequences from Nebenzahl-Guimaraes et al. (36) and one genome sequnece from Ngabonziza et al. (7) (Table S5a). M. canettii (GenBank accession number {"type":"entrez-nucleotide","attrs":{"text":"NC_015848.1","term_id":"340625033","term_text":"NC_015848.1"}}NC_015848.1) was used as an outgroup. Core genome SNP calling was done using Snippy v4.4 ( using M. tuberculosis H37Rv as the reference genome. SNP-IT v1.1 (37) program was also used to predict and confirm M. tuberculosis lineages. Core genome SNP alignment file was used to compute a maximum likelihood phylogenetic tree using the GTR+G+ASC_LEWIS model in RAxML-NG v1.0.1 (48). Branch support was computed with 1,000 bootstrap replicates. The tree was visualized using the iTOL server. The CRISPR-Cas locus was predicted using CRISPRCasFinder. The presence of Cas10 HD and cyclase domains was predicted using Scanprosite ( and Cas10 HD and cyclase domains were aligned using MAFFT v7.IS element annotation within the CRISPR-Cas region. CRISPR-Cas loci were predicted using the CRISPRCasFinder, as described earlier. The loci were extracted from the genomes using in-house perl script. Two sequences ({"type":"entrez-nucleotide","attrs":{"text":"NC_020089","term_id":"479053932","term_text":"NC_020089"}}NC_020089 and {"type":"entrez-nucleotide","attrs":{"text":"NC_016934","term_id":"392384721","term_text":"NC_016934"}}NC_016934) were removed due to assembly gaps in the CRISPR repeat region. IS6110 transposase was annotated using Prokka v1.14 (56) in the CRISPR-Cas locus genome segment and aligned using NCBI-BLASTN using default parameters. Easyfig v2.2.3 was used to map and compare the CRISPR-Cas loci among M. tuberculosis lineage genomes. M. tuberculosis genome (accession {"type":"entrez-nucleotide","attrs":{"text":"NC_000962","term_id":"448814763","term_text":"NC_000962"}}NC_000962) was considered the reference genome, and M. canettii (accession {"type":"entrez-nucleotide","attrs":{"text":"NC_015848","term_id":"340625033","term_text":"NC_015848"}}NC_015848) was considered the ancestral genome. For identification of IS6110 insertion at CRISPR loci, M. tuberculosis (accession number {"type":"entrez-nucleotide","attrs":{"text":"NC_000962","term_id":"448814763","term_text":"NC_000962"}}NC_000962) complete CRISPR-Cas locus sequence was extracted using in-house perl script. Overlapping region between IS6110 and CRISPR repeats was analyzed and represented with CRISPR-Cas locus visualization generated using Easyfig v2.2.3 (57). We manually extracted genome sequence present between CRISPR2 and IS6110 to identify a potential spacer sequence, as CRISPRCasFinder and CRISPRDetect were unable to detect it due to the split in the flanking CRISPR repeat. Further, BLASTN was performed with the retrieved sequence as a query against all MTBC spacer sequences using default parameters.

Article TitleComparative Genomic Analysis ofMycobacteriaceaeReveals Horizontal Gene Transfer-Mediated Evolution of the CRISPR-Cas System in theMycobacterium tuberculosisComplex


Genomic comparisons ofM. tuberculosiswith related bacteria offer valuable insights into the evolutionary history and emergence of pathogenic strains. The present study, a comprehensive comparative genomic analysis of 141 mycobacterial genomes, showed the exclusive presence of the CRISPR-Cas type III-A system in MTBC. Further analysis revealed that CRISPR-Cas type III-A system was likely acquired in the last common ancestor of STB-A and MTBC by a HGT-driven acquisition. The plausible source seems to be aStreptococcus-like environmental bacterium. Our work reveals that although the genomic organization of CRISPR-Cas locus is conserved inM. tuberculosislineages, certain specific strains show considerable deletions. These deletions, best exemplified in Beijing sublineage members, are driven by active transposition of IS6110, which utilize the DR of CRISPR region for insertion. This work delineates the evolutionary events such as HGT and IS6110-driven genomic variations in mycobacteria to better comprehend the epidemiology ofM. tuberculosislineages.

Login or Signup to leave a comment
Find your community. Ask questions. Science is better when we troubleshoot together.
Find your community. Ask questions. Science is better when we troubleshoot together.

Have a question?

Contact or check out our support page.