RefSeq genomes for P. aeruginosa, A. baumannii, Neisseria meningitidis, Staphylococcus epidermidis, Streptococcus pyogenes, Francisella tularensis, Mycobacterium tuberculosis, Neisseria gonorrhoeae, E. faecium and E. faecalis were retrieved from NCBI between December 2020 and January 2021 in nucleotide fasta format using ncbi-genome-download v0.3.0 (https://github.com/kblin/ncbi-genome-download). The strains were selected to include a variety of CRISPR-Cas system types, pathogenic lifestyles (i.e. obligate pathogens such as F. tularensis, M. tuberculosis and N. gonorrhoeae as well as opportunistic pathogens like P. aeruginosa, S. aureus and S. pyogenes), and species of significance in the spread of ABR such as A. baumannii and E. faecium. The species used are summarised in Table 1.
- View inline
- View popup
Table 1 –
Summary of species included
Identification of CRISPR-Cas systems, ABR genes, plasmids, ICEs, and intI1
CRISPR-Cas systems were detected using CRISPRCasTyper 20, and orphan CRISPR arrays were not included in any spacer analysis. CRISPR-Cas systems with “Unknown” predictions were also removed from the dataset. Acquired ABR genes and plasmid replicons were identified using abricate (https://github.com/tseemann/abricate) with the NCBI AMRFinderPlus 21 and PlasmidFinder 22 databases, both from 19/04/2019 and containing 5386 and 460 sequences, respectively. To detect ICEs and intI1, custom BLAST databases were created based on the ICEberg database (retrieved January 2021; Liu et al. 2019), and the NCBI gene entry for class 1 integron integrase intI1 NC_019069.1 (99852.100865, complement), and used with abricate.
Identification of spacer targets and MGEs carrying ABR genes
Spacer sequences were searched against BLAST databases constructed from the aforementioned ABR, ICE and prophage databases as well as a curated plasmid database 24, and the complete RefSeq viral release (retrieved February 2021). Only hits with 100% identity were used for downstream analyses. The total number of spacers per genome was approximated by subtracting 1 from the number of repeats for each array and summing these values, and any spacers with no hits in any database were assigned as “unknown”.
MGEs carrying ABR genes were identified by running a blastn search with default parameters, querying all sequences in search databases used for ICE, plasmid and virus detection against a custom BLAST database built from the aforementioned NCBI AMRFinderPlus database. Subsequently, a percent identity cutoff of 90% was applied to hits. The presence of ABR genes on MGEs was classified in a binomial way, in which ≥ 1 ABR gene was counted as the presence of ABR on the element. In cases where multiple MGEs were targeted by a single spacer, the spacer was classified as targeting an MGE with ABR if at least one of these MGEs carried at least one ABR gene.
Data analysis and statistical modelling
The University of Exeter’s Advanced Research Computing facilities were used to carry out this work. Data analyses were conducted in R v4.0.2 using the tidyverse suite of packages 25. Unless otherwise specified, linear or generalized linear models were fitted using the lm or glm functions in base R. Linear models allow the inclusion of multiple predictors in one model, termed fixed effects, to estimate the relationship between these predictors and a response variable. These models generate a formula that calculates the mean of a response variable based on the value of each fixed effect (predictor). Regression coefficients are fitted, which give the slope and the intercept (value of the response variable when the predictor = 0) for each fixed effect. For categorical predictors, a different intercept is fitted for each level of the predictor (e.g. each species in a multispecies model). To fit different slopes for levels of categorical predictor, interaction terms must be fitted between them and a continuous predictor. For example, an interaction between species and the count of ABR genes would allow varying slopes and intercepts for each individual species.
For these analyses, a maximal model was generated and all possible candidate models were compared using the AIC method using dredge from the MuMIn package 26. AIC values assess the fit of a model by looking at the likelihood of a model given the data, penalising for increased number of parameters (as increased complexity of the model increases parameter uncertainty). The model with the lowest AIC value was selected, and no alternative models were within 2 AICs of the best fitting model.
Prediction data frames with 95% confidence intervals were generated using ggpredict from the package ggeffects 27, model dispersion was tested and scaled residuals were examined using DHARMa residual diagnostics 28, and the final predictions were visualised with ggplot2, ghibli 29 and cowplot (https://wilkelab.org/cowplot/index.html). Prediction plots were produced only for biologically realistic combinations of fixed effects and response, i.e. those where sufficient data was present in the data set.
Associations between genome length and CRISPR-Cas presence/absence were tested using a linear model with genome length as the response variable and CRISPR-Cas presence/absence and species as fixed effects, with CRISPR-Cas presence/absence × species as an interaction term.
The effect of various MGEs on the distribution of CRISPR-Cas was modelled using a binomial GLM with CRISPR-Cas presence/absence as the response variable and counts of ICEs, plasmid replicons, ABR genes and intI1 copies, as well as species, as fixed effects. In addition, we fitted an interaction between species and every other fixed effect. Prediction data frames were created up to the maximum number of each MGE detected in each species (unless this value was an outlier, in which case a more biologically informative upper limit was set).
The relationship between spacers targeting MGEs with and without ABR and the count of ABR genes was modelled using a Bayesian Poisson GLM in MCMCglmm. The default weakly informative priors were used, and models were run for 20,000 iterations with a burn in period of 10,000 iterations and a thinning interval of 5. Two models were run, with the number of ABR genes as the response variable, species as a fixed effect, and either the counts of spacers targeting MGEs with ABR or targeting MGEs without ABR as an additional fixed effect. Prediction data frames were created up to the maximum number of spacers detected targeting MGEs with and without ABR.
Construction of trees and genetic distance-controlled single-species models
The package mashtree 30 with mash v2.0.0 31 was used to create neighbour-joining trees for each species based on whole genomes. Genomes were sketched using mash with default parameters prior to using mashtree. Mash distance was calculated for all isolates relative to the first genome entry for the group, in order to remove outliers with a distance > 0.1, which are likely to have had their species misclassified. Trees were subsequently rooted to outgroups and ultrametric trees were produced with the ape package in R 32 using the chronos function with smoothing parameters (λ) of 1 and 10. The “correlated”, “discrete” and “relaxed” models were used for each value of λ, which vary in the extent to which they permit adjacent parts of the tree to evolve at different rates. The tree with the highest log-likelihood for each species was used. Due to the large number of genomes and low prevalence of CRISPR-Cas, we did not perform this analysis for S. aureus.
The MCMCglmm package 33 was used to run Bayesian GLMs incorporating trees. The default weakly informative priors were used, and all models were run for 30,000 iterations with a burn in period of 20,000 iterations and a thinning interval of 5. Poisson GLMs were run for each species, with count of ABR genes as the response variable and CRISPR-Cas type or spacer count (in which case genomes with no CRISPR-Cas were removed) as a fixed effect. In cases where more than one CRISPR-Cas type was present, CRISPR-Cas type × total spacers was fit as an interaction. The small sample size for S. epidermidis with CRISPR-Cas systems (n = 54), meant we chose not to model the association between spacer count and ABR genes for this species. Prediction data frames for spacer counts were created up to the maximum count of spacers detected for that species, with the exception of S. pyogenes, for which the top value was an outlier at 33 so the next highest value of 17 was taken.
Code availability and reproducibility
The snakemake pipeline used to calculate genome lengths and detect CRISPR-Cas systems, ABR genes, intI1, ICEs and plasmid replicons, as well as R scripts used for producing trees and statistical analysis for this work are available at https://github.com/elliekpursey/crispr-pathogens. Snakemake v5.18.1 34 and Python v3.8.3 with Biopython v1.78 35 were used for this work.
The acquisition of antibiotic resistance genes via horizontal gene transfer is a key driver of the rise in multidrug resistance amongst bacterial pathogens. Bacterial defence systems per definition restrict the influx of foreign genetic material, and may therefore limit the acquisition of antibiotic resistance. CRISPR-Cas adaptive immune systems are one of the most prevalent defences in bacteria, found in roughly half of bacterial genomes, but it has remained unclear if and how much they contribute to restricting the spread of antibiotic resistance. We analysed ~40,000 whole genomes comprising the full RefSeq dataset for 11 species of clinically important genera of human pathogens including Enterococcus, Staphylococcus, Acinetobacter and Pseudomonas. We modelled the association between CRISPR-Cas and indicators of horizontal gene transfer, and found that pathogens with a CRISPR-Cas system were less likely to carry antibiotic resistance genes than those lacking this defence system. Analysis of the mobile genetic elements targeted by CRISPR-Cas supports a model where this host defence system blocks important vectors of antibiotic resistance. These results suggest a potential “immunocompromised” state for multidrug-resistant strains that may be exploited in tailored interventions that rely on mobile genetic elements, such as phage or phagemids, to treat infections caused by bacterial pathogens.