Methods

SpacePHARER: Sensitive identification of phages from CRISPR spacers in prokaryotic hosts

Input

SpacePHARER accepts spacer sequences as multiple FASTA files each containing spacers from a single prokaryotic genome or as multiple output files from the CRISPR detection tools PILER-CR 7, CRT 5, MinCED 13 or CRISPRDetect 4. Phage genomes are supplied as separate FASTA files or can be downloaded by SpacePHARER from NCBI GenBank 2. Optionally, additional taxonomic labels can be provided for spacers or phages to be included in the final report.

Algorithm

SpacePHARER is divided into five steps (Figure 1A, Supp. Materials). (0) Preprocess input: scan the phage genome and CRISPR spacers in six reading frames, extract and translate all putative coding fragments of at least 27 nt, with user-definable translation tables. Each query set Q consists of the translated ORFs q of CRISPR spacers extracted from one prokaryotic genome, and each target set T comprises the putative protein sequences t from a single phage. We refer to similar q and t as hit, and an identified host-phage relationship QT as match. (1) Search all q’s against all t’s using the fast, sensitive MMseqs2 protein search 14, with VTML40 substitution matrix 10, gap open cost of 16 and extension cost of 2 (Figure S1). We optimized a short, spaced k-mer pattern for the prefilter stage (10111011) with six informative (‘1’) positions. In addition, align all qt hits reported in previous search on nucleotide level and prioritize near-perfect nucleotide hits (Supp. Materials). (2) For each qT pair, compute the P-value for the best hit p_bh from first-order statistics. (3) Compute a combined score _S_comb from best-hit P-values of multiple hits between _Q and T using a modified truncated-product method (Supp. Materials). (4) Compute the false discovery rate (FDR = FP /(TP + FP)) and only retain matches with FDR < 0.05. For that purpose, SpacePHARER is run on a null model database and the fraction of null matches with _S_comb below a cut-off (empirical P-value) is used to estimate the FDR. (5) Scan 10 nt upstream and downstream of the phage’s protospacer for a possible PAM.

FIG 1.

(A) SpacePHARER algorithm. A query set Q consists of 6-frame translated ORFs (q) from CRISPR spacers, and a target set T consists of 6-frame translated ORFs (t) of phage proteins. (1) Search all q_s against all _t_s using MMseqs2. Align the _qt hits on nucleotide level and prioritize near-perfect nucleotide hits. (2) For each qT pair, compute the P-value for the best hit from first-order statistics. (3) Compute score S_comb by combining the best-hit P-values from multiple hits between _Q and T using a modified truncated-product method. (4) Estimate the FDR by searching a null database. (5) Scan for possible protospacer adjacent motif (PAM). (B) Performance comparison between SpacePHARER (blue) and BLASTN (red) using inverted phage sequences (solid lines) or eukaryotic viral ORFs as null set (dashed lines) demonstrated by expected number of true positive (TP) predictions at different false discovery rates (FDRs). (C) Performance comparison between BLASTN (left), SpacePHARER using the weighted lowest common ancestor procedure (LCA, right) at FDR = 0.02, evaluated by the number of correct (blue) and incorrect (red) predictions, for all the host predictions made at each taxonomic rank or below.

Output

is a tab-separated text file. Each host-phage match spans two or more lines. The first starts with ‘#’: prokaryote accession, phage accession, _S_comb, number of hits in the match. Each following line describes an individual hit: spacer accession, phage accession, _p_bh, spacer start and end, phage start and end, possible 5’ PAM|3’ PAM, possible 5’ PAM|3’ PAM on the reverse strand. If requested, the spacer–phage alignments are included.

If taxonomic labels are provided, taxonomic reports based on the weighted lowest common ancestor (LCA) procedure described in 9 are created for host LCAs of each phage genome or phage LCAs of each spacer as additional tab-separated text files.

Article TitleSpacePHARER: Sensitive identification of phages from CRISPR spacers in prokaryotic hosts

Abstract

SpacePHARER (CRISPR Spacer Phage-Host Pair Finder) is a sensitive and fast tool for de novo prediction of phage-host relationships via identifying phage genomes that match CRISPR spacers in genomic or metagenomic data. SpacePHARER gains sensitivity by comparing spacers and phages at the protein level, optimizing its scores for matching very short sequences, and combining evidence from multiple matches, while controlling for false positives. We demonstrate SpacePHARER by searching a comprehensive spacer list against all complete phage genomes.


Login or Signup to leave a comment
Find your community. Ask questions. Science is better when we troubleshoot together.
Find your community. Ask questions. Science is better when we troubleshoot together.

Have a question?

Contact support@scifind.net or check out our support page.