Legionella pneumophila CRISPR-Cas suggests recurrent encounters with Gokushovirinae

Legionella genomes used in this study

Legionella pneumophila genomes (draft and completed) were downloaded from the European Nucleotide Archive (at: (38) and from NCBI (at: (39). A complete list of accession numbers can be found in Table S1.

Some target sequence data were produced by the US Department of Energy Joint Genome Institute (at: in collaboration with the user community (Table S7).

Prophage identification

The downloaded L. pneumophila genomes were annotated with Prokka (v.1.14.6) (40) and the resulting GenBank formatted files were analyzed using PhiSpy (41) with the default settings. The original nucleotide FASTA files were analyzed using VirSorter (42) with the default settings.

CsrT amino acid sequences from mobile elements (43) identified by PhiSpy and VirSorter were extracted and aligned with MUSCLE (44) and a tree was generated using FastTree (v 2.1.11) with the default settings (45). The tree was rooted to group I, Trb (43), and visualized using the Interactive Tree of Life (iTOL) web server (46). Reference sequences were annotated as per Abbott and colleagues (43).

The 29 nucleotide LME-1 attachment site (GTCTGATTATCAAAATAATCAGACTTAAT) (21) was queried by BLASTn against L. pneumophila genomes using the following parameters: -gapopen 10 -gapextend 2 -reward 1 -penalty -1 -evalue 1 -word_size 7 (47).

Core genome identification across LME-1 variants

The FASTA sequences for unique LME-1 variants were subjected to an OrthoMCL analysis to look at the overlap in gene content. Briefly, the sequences were annotated using Prodigal (48), then were analyzed using OrthoMCL (v.2.0.9) with the default settings (49). The resulting clusters were queried by BLASTp using the NCBI viral database (taxid:10239) (47) and using HHpred (at: (50) against “UniProt-SwissProt-viral70_23_Aug_2020” using default settings to determine if any had sequence similarity to known phage genes.

Core genomes for each LME-1 variant, containing only genes present within all the variants, were aligned using MUSCLE (44) and a tree was generated using FastTree (v 2.1.11) using the default settings and the GTR model (45). The tree was visualized with ggtree (v. 2.2.4) (51), and the orthologous group presence/absence map was plotted with ggplot2 (v. 3.3.3) (52).

CRISPR-Cas identification and spacer cataloguing

CRISPRCasFinder (53) was used to detect putative CRISPR-Cas systems in L. pneumophila genomes (Table S1) and CRISPRDetect (54) was used to determine the transcriptional direction of each CRISPR array. Spacers were extracted from each array and compiled to form spacer libraries. Redundant spacers were removed from the libraries to create a unique spacer library for downstream analyses. Rarefaction curves for the spacer library of each CRISPR-Cas sub-type were generated using Analytic Rarefaction (v. 2.2.1) using the default parameters (available from The curves were plotted using ggplot2 (v. 3.3.2) (52).

CRISPR-Cas target search

A pipeline to search for targets of L. pneumophila CRISPR-Cas was written as a Unix shell script. L. pneumophila genomes were first searched for putative type I-C, I-F and II-B CRISPR arrays using the repeat structure as a guide. If a CRISPR array was detected, it was subsequently masked to minimize false positive hits to the input genomes. A BLAST database was then constructed containing the Legionella genomes with masked CRISPR arrays to look for integrated genetic elements (see Table 1). The unique spacer library (see above) was queried against the BLAST database (BLASTn parameters: -gapopen 10 -gapextend 2 -reward 1 –penalty -1 -evalue 0.01 -word_size 7) (47). The BLAST search was done twice, once for each strand. The BLAST output was then processed to score each hit. The scoring metric subtracted the length of the alignment between a spacer and a hit from the length of the spacer, then added the number of mismatches between the two sequences. For example, an alignment length of 30, a spacer length of 32 and 0 mismatches would generate a score of 2. This was done to take both alignment length and number of mismatches into account when ordering the hits for downstream analyses (for example, so a hit with 0 mismatches and a short alignment length would not be ranked higher than a hit with 2 mismatches but an alignment length across the entire spacer).

  • View inline
  • View popup

Table 1.Target databases.

The databases used for spacer target searches for integrated elements and exogenous elements.

After scoring, the BLAST output was converted into a BED format. During this step, the putative target sequences were extended to account for alignment length, followed by the addition of a 3 nucleotide extension on either end of the sequence for downstream protospacer adjacent motif (PAM) filtering. SAMtools (55) and BEDTools (56) were used to extract the putative target sequences and their flanking regions from the input genomes. Finally, the pipeline searched for canonical PAMs for the type I-C, I-F and II-B systems in the flanking region of the putative target sequence. Although our pipeline searched for type I-C, I-F and II-B PAMs, we considered this secondary information and were intentionally PAM agnostic when assigning spacers to target groups to avoid missing potential targets that have since escaped through PAM mutations or those targeted by non-canonical PAMs. Analysis of the PAM sequences present in the target sequences showed that many target sequences did contain the canonical PAM (Figure S3).

The pipeline was also used to search for exogenous threats to the bacterium in viral, plasmid and metagenomic sequences and run as described above (data sets listed in Table 1). The output from the pipeline can be found in Table S6 for complete/draft genomes, and Table S7 for metagenomic data. CRISPR arrays containing spacers with ascribed targets were visualized with CRISPRStudio (57). PAM sequence logos for the hits were plotted using ggseqlogo (v 0.1) (58).

Major capsid protein phylogeny of targeted Microviridae genomes

A phylogeny of Microviridae major capsid proteins was generated using the amino acid sequences from 3014 genomes (Table S8). The amino acid sequences were aligned using MUSCLE (44) and the tree was generated with FastTree (v 2.1.11) using the default settings (45). The resulting tree was rooted to the Bullavirinae clade and visualized using the iTOL web server (46). The sub-family designations for each phage based on the literature were plotted as a colour-coded ring circling the tree (Table S8).

Sequence conservation and spacer mapping for repeat L. pneumophila CRISPR-Cas targets

Sequence conservation analysis and spacer mapping were performed on the Microviridae genomes that contained a top-hit target for a spacer that met the following stringency criteria: 5 or fewer mismatches with its protospacer, a canonical PAM for its CRISPR subtype, and no mismatches in the seed sequence region (positions 1-5, 7, 8). The VP1 (major capsid protein), VP2 (DNA pilot protein) and VP4 (replication initiation protein) genes were extracted from these genomes using Geneious Prime 2021 (available from:, and the sequences for each gene were aligned using the MUSCLE plug-in (44). The alignment, consensus sequence, sequence identity profiles and spacer location for each gene were plotted using GViz (v. 1.32.0) (59), binning the sequence conservation with a sliding window of 5.

Article TitleLegionella pneumophila CRISPR-Cas suggests recurrent encounters with Gokushovirinae


Legionella pneumophila is a ubiquitous freshwater pathogen and the causative agent of Legionnaires’ disease. This pathogen and its ability to cause disease is closely tied to its environmental encounters. From phagocytic protists, L. pneumophila has “learned” how to avoid predation and exploit conserved eukaryotic processes to establish an intracellular replicative niche. Legionnaires’ disease is a product of these evolutionary pressures as L. pneumophila uses the same molecular mechanisms to replicate in grazing protists and in macrophages of the human lung. L. pneumophila growth within protists also provides a refuge from desiccation, disinfection, and other remediation strategies. One outstanding question has been whether this protection extends to phages. L. pneumophila isolates are remarkably devoid of prophages and to date no Legionella phages have been identified. Nevertheless, many L. pneumophila isolates maintain active CRISPR-Cas defenses. So far, the only known target of these systems has been an episomal element that we previously named Legionella Mobile Element-1 (LME-1). In this study, we have identified over 150 CRISPR-Cas systems across 600 isolates, to establish the clearest picture yet of L. pneumophila’s adaptive defenses. By leveraging the sequence of 1,500 unique spacers, we can make two main conclusions: current data argue against CRISPR-Cas targeted integrative elements beyond LME-1 and the heretofore “missing” L. pneumophila phages are most likely lytic gokushoviruses.

IMPORTANCE The causative agent of Legionnaires’ disease, an often-fatal pneumonia, is an intracellular bacterium, Legionella pneumophila, that normally grows inside amoebae and other freshwater protists. Unfortunately for us, this has two major consequences: the bacterium can take what it has learned in amoebae and use similar strategies to grow inside our lungs; and these amoebae can protect Legionella from various forms of chemical and physical disinfection regimes. Legionella are ubiquitous in the environment and frequently found in man-made water systems. Understanding the challenges to Legionella survival before it reaches the human lung is critical to preventing disease.

We have leveraged our earlier discovery that L. pneumophila CRISPR-Cas systems are active and adaptive – meaning that they respond to contemporary threats encountered in the environment. In this way, CRISPR arrays can be considered genomic diaries of past encounters, with spacer sequences used to identify elements that may impinge on the pathogen’s survival. One outstanding question in the field is whether L. pneumophila is susceptible to phage, given the presumptive protection provided by intracellular replication within its eukaryotic hosts. In this work, we use CRISPR spacer sequences to suggest that the heretofore “missing” L. pneumophila phage are most likely lytic gokushoviruses. Such information is critical to the long-term goal of developing of new strategies for preventing colonization of our water systems by Legionella and subsequent human exposure to the pathogen.

Login or Signup to leave a comment
Find your community. Ask questions. Science is better when we troubleshoot together.
Find your community. Ask questions. Science is better when we troubleshoot together.

Have a question?

Contact or check out our support page.