MATERIALS AND METHODSGrowth conditions. E. coli strains cultured in this study comprised a set of 72 natural isolates known as the ECOR collection (30). LB medium was used for growth, and cultures were incubated at 37° C for 12 h.PCR and sequencing. DNA templates were extracted from cells grown with shaking in liquid medium. After growth, cultures were centrifuged, the supernatant was removed, and the cell pellet was resuspended in 1 ml of ultrapure (Milli-Q) water. This washing was repeated for a total of three times. Lysis was achieved by heating at 98° C for 10 min and cell debris removed by centrifugation. Finally, the supernatant solution containing the DNA was stored in aliquots at −20° C.PCRs were conducted under standard conditions (annealing temperature Tª, 55°C) with Taq polymerase (Roche) on a TC-3000 thermal cycler (Techne). Primer cysH-F (5′ CGTTTTTATTTTGCGAGCAGC 3′), hybridizing at the conserved intergenic region closest to the cysH flanking gene, was used in combination with either primer cas3E1-R (5′ TCGTCGCCCCCGTCTTTCTC 3′) or primer cas3E2-R (5′ CAGATGAATATCATTTCCTTTCG 3′), both hybridizing at equivalent positions close to the 5′ end of the cas3 gene of their respective variants. PCR products were purified with a QIAquick PCR purification kit (Qiagen). Sequencing was performed with a BigDye Terminator cycle sequencing kit in an ABI Prism 310 DNA sequencer, after the manufacturer’s indications (Applied Biosystems).Source of sequence data. Genomic sequences were retrieved from public nucleotide databases (http://www.xbase.ac.uk/main/browse/; http://www.ncbi.nlm.nih.gov/genomes/). In the case of ECOR strains, partial sequences of genes used for multilocus sequence typing were downloaded from the Environmental Research Institute, University of Cork (http://MLST.ucc.ie; dinB, icdA, pabB, polB, putP, trpA, trpB, and uidA genes), and from the Institut Pasteur (http://www.pasteur.fr/recherche/genopole/PF8/mlst/; adk, fumC, gyrB, icdA, mdh, purA, and recA genes) websites. Data on repeat number and cas and spacer content, as well as on the presence of insertion elements in CRISPR loci of ECOR strains, derived from a previous study (19).Sequence analyses. Phylogenetic analyses of nucleotide sequences (for cas and MLST genes) were carried out with the program MEGA version 4 (54) from alignments conducted with CLUSTALW (http://genome.jp/tools/clustalw/) and manually edited to correct mismatches. Sequence trees were constructed using the unweighted-pair group method using average linkages (UPGMA), with distances calculated by the Jukes-Cantor model on a pairwise-deletion comparison that allowed the inclusion of partial sequences. However, the lack of proper alignment of the partially deleted cas3 of Shigella flexneri 2a strain 301 (Sf301) prevented its use in the analyses.For the construction of codon usage and trees based on spacer absence or presence, binary clustering analyses were performed with NTSYSpc 2.0 (Exeter software). As with the sequence data, trees were built using UPGMA. Distances were calculated by the average taxonomic distance model. For the generation of the matrix based on the combined binary data of the spacers from the 5 CRISPR arrays analyzed, an iterative procedure was used to select the characters to be considered. First, for each CRISPR array, only those spacers present in the highest number of strains (defining the spacer groups SG) were considered. Next, strains not included in any SG but sharing at least one spacer with any member within it were recruited. Then, each remaining ungrouped strain as well as each SG was considered a distinct character (i.e., all strains within each group were assigned the same character value). Further, the same procedure was later applied within each SG to define potential subgroups as new distinct characters, although this was done only if the new results obtained were different from those obtained with the original SG. A CRISPR2.1 spacer present in the vast majority of strains, thus not being discriminative, and spacers that were identical but that were located in different loci (presumably acquired in separate events) were not considered for the generation of SGs. For the construction of trees based on codon usage, codon usage frequencies were determined with the Countcodon application (http://www.kazusa.or.jp/codon/countcodon.html) and converted into a binary matrix of characters. Either 1 or 0 was assigned to the codons of each amino acid depending on whether the score was above or below the cutoff value of 80% with respect to the particular maximum.Analyses of recombination at the cas-E1 sequence variants were performed with the program GENECONV 1.81 (Department of Mathematics, Washington University, St. Louis, MO; http://www.math.wustl.edu/~sawyer/geneconv). Only strains bearing the complete set of cas genes were considered for the analysis. Two selected strains (if present) were chosen for each main MLST cluster. In the case of the more abundant strains from B1 group, at least two strains were taken from those subclades diverging more than 0.2%. For each strain included in the analysis, the concatenated cas sequences were aligned and the nucleotide differences for each pair were statistically tested by the program to seek for recombinational events. Pairwise comparisons rendering a Bonferroni-corrected Karlin-Altschul P value of less than 0.05 were deemed significant for recombination between the two sequences.IS-Finder (https://www-is.biotoul.fr/) was used for the identification of insertion elements. Consensus leader sequences were obtained with WebLogo (http://weblogo.berkeley.edu/logo.cgi). CAI values were calculated using the application at http://genomes.urv.es/CAIcal/ (30), with the codon usage frequencies of the entire genome of E. coli K-12-MG1655 (http://www.kazusa.or.jp/codon/) as a reference. Three independent sets of E. coli K-12 sequences with their estimated mutation rates (μ) were selected: (i) MLST analysis (this work), (ii) lacI and his operons (55), and (iii) a collection of randomly distributed genes (56). The CAI-log μ representation of these genes allowed us to infer a lineal regression (r2 > 0.99) which was used to extrapolate μ from the CAI of the different sets of cas genes (cas-E1, cas-E2, and cas-F).Statistical analyses. Analysis of variance (ANOVA) tests were performed using SPSS software version 17.0 (SPSS 111 Inc., Chicago, IL). A P value of less than 0.05 was considered significant.
Clustered regularly interspaced short palindromic repeats (CRISPR) and CRISPR-associated (cas) genes constitute the CRISPR-Cas systems found in theBacteriaandArchaeadomains. At least in some strains they provide an efficient barrier against transmissible genetic elements such as plasmids and viruses. Two CRISPR-Cas systems have been identified inEscherichia coli, pertaining to subtypes I-E (cas-E genes) and I-F (cas-F genes), respectively. In order to unveil the evolutionary dynamics of such systems, we analyzed the sequence variations in the CRISPR-Cas loci of a collection of 131E. colistrains. Our results show that the strain grouping inferred from these CRISPR data slightly differs from the phylogeny of the species, suggesting the occurrence of recombinational events between CRISPR arrays. Moreover, we determined that the primarycas-E genes ofE. coliwere altogether replaced with a substantially different variant in a minor group of strains that include K-12. Insertion elements play an important role in this variability. This result underlines the interchange capacity of CRISPR-Cas constituents and hints that at least some functional aspects documented for the K-12 system may not apply to the vast majority ofE. colistrains.