MATERIALS AND METHODSData sets. The NCBI RefSeq database, which contained 75,599 bacterial genomes at that time, was downloaded on 27 August 2017. The IMG/VR database (28), which contained 760,453 assembled viral/proviral contigs at that time, was downloaded on 15 March 2019.Methods. A bioinformatics pipeline (Fig. 2) was developed to process the RefSeq and IMG/VR genomic data using a list of filters to identify putative Acr-Aca loci. These filters essentially exploited sequence features extracted from a list of published Acr-Aca loci (http://bcb.unl.edu/AcrDB/Download/knownAcrAca/known-loci.xlsx; see also Text S1 in the supplemental material). This list contains representative proteins of 45 characterized Acr families as well as their associated Aca proteins.In the data processing pipeline, two files of each RefSeq genome were processed: (i) a gene location file with the protein coding genes’ position and strand information in the DNA (i.e., the gff format file) and (ii) a protein sequence file (i.e., the faa format file). For the IMG/VR contigs, FragGeneScan (37) was run first on the nucleotide genome file (the fna format file) to generate the gene location file and protein sequence file.The pipeline comprised the following steps.(i) We used the published 45 Acr proteins (http://bcb.unl.edu/AcrDB/Download/knownAcrAca/Acrs/) as the query to search against 75,599 RefSeq bacterial genomes and 760,453 metagenome-assembled viral contigs (∼3% are from isolated phages or prophages) of the IMG/VR database. To qualify as Acr homologs, proteins have to meet the following criteria: (i) E value of <1e−2 to known Acr proteins, (ii) protein length of <200 amino acids (aa), and (iii) more importantly, Acr genes located in genomic loci (or operons) with all the genes encoding short proteins (<200 aa) on the same strand. Then, HTH domain-containing proteins (Acas) were searched for in the gene neighborhood of Acr homologs (Fig. 2 shows the criteria).(ii) We then combined these new Aca proteins with the 39 previously published Aca proteins (Text S1), and in total 401 Aca proteins (http://bcb.unl.edu/AcrDB/Download/knownAcrAca/Acas/) were used as query to search against the 75,599 RefSeq bacterial genomes and the 760,453 metagenome-assembled viral contigs for Aca homologs (Fig. 2 shows the criteria). Then, we located the Aca homologs in genomic loci (or operons) encoding only short proteins (<200 aa) on the same strand, with short intergenic distances (<150 bp), and at least one gene encoding the Aca homolog. These are the strongest sequence features revealed in Text S1.(iii) The genomic loci were then examined to see if they were located within or adjacent (±5kb) to mobile genetic elements (MGEs) such as prophages and genomic islands (GIs). Specifically, the genomic positions of the genomic loci were compared to the genomic locations of prophages in the PHASTER database (29) and to the genomic locations of GIs in the IslandViewer database (30).(iv) The last step was to inspect if genomes with the genomic loci from the previous step also have complete CRISPR-Cas systems and self-targeting CRISPR arrays. Specifically, Watters et al. (12) identified 22,125 self-targeting cases in 9,155 bacterial genomes (available in Data S1 of the paper by Watters et al. 12). Genomes from the previous step that were also included in this paper by Watters et al. (12) thus contain self-targeting spacers and their targets and were kept for further analysis. Some genomes have incomplete CRISPR-Cas loci, e.g., having only CRISPR arrays or having only Cas enzymes or missing some key Cas enzymes. Genomic loci from these genomes were removed. Additionally, only genomic loci that are colocalized with CRISPR self-targeted protospacers on the same contig/chromosomes were kept.
Article TitleBioinformatics Identification of Anti-CRISPR Loci by Using Homology, Guilt-by-Association, and CRISPR Self-Targeting Spacer Approaches
The NCBI RefSeq database, which contained 75,599 bacterial genomes at that time, was downloaded on 27 August 2017. The IMG/VR database (28), which contained 760,453 assembled viral/proviral contigs at that time, was downloaded on 15 March 2019.