Materials and Methods

PINCER: improved CRISPR/Cas9 screening by efficient cleavage at conserved residues

MATERIALS AND METHODSConstruction of a CRISPR/Cas9 sgRNA feature databaseWe extracted all NGG-associated guide sequences and coordinates from the human and mouse reference genomes (hg38, mm10), including unplaced contigs but excluding alternate haplotypes (56). Although early reports suggested that S. pyogenes Cas9 also recruits to NAG PAMs, more recent work has shown Cas9’s affinity is much stronger to NGG than all other PAM sequences, so we only considered NGG sgRNAs (13,57).Annotation of intrinsic guide featuresWe scored guide cleavage efficacy using the Microsoft Azimuth 2.0 implementation of the Broad Rule Set 2 score (Python v2.7.13, Numpy v1.12.1, training data V3_model_nopos.pickle) (13). Next, we aligned each guide back to the genome with BatMis v3.00 (a Burrows–Wheeler transform aligner similar to BWA or Bowtie but with more detailed alignment reporting), requiring the ‘GG’ PAM dinucleotide and allowing up to 5000 alignments with up to four mismatches and aggregated these into a single specificity score using our own implementation of (Hsu et al.) (12,58). We repeated this process using alignments to known protein-coding sequence to generate a proteome-wide specificity score, using Gencode-basic genes with transcript support level 1 (human v27, mouse vM15) (59). We tracked whether guides exceeded 5000 alignments, and the number of alignments with 0–4 mismatches. Considering complete sgRNA scaffolds (Supplementary Table S19), we also tracked maximum homopolymer lengths in each guide (polyA, polyC, polyG, polyT) and whether guides contained inadvertent Esp3I restriction sites (CGTCTC). Finally, we predicted each guide's melting temperature using the Bioconductor HELP library's implementation of (Allawi and SantaLucia), which in our hands was much faster than the popular Primer3 tool and while offset by about 10°C produced nearly identical rank-ordered results (Supplementary Figure S4) (60–62).Annotation of guide target featuresIn order to annotate features of guide targets we mapped feature tracks to the genome and intersected them with guide cut sites. We started by annotating principal isoforms (APPRIS, v108 human, v106 mouse) of Refseq protein-coding genes (v109) (63,64). We tracked and considered targeting consensus protein sequence across isoforms, but favored principal isoforms after observing that low-confidence short isoforms hinder domain targeting (data not shown). Next, we tracked each guide's position in the target's coding sequence (%CDS, alternatively referred to as 3’%), the percentage of frameshifts capable of inducing NMD by our own implementation of ‘the 50-nt rule,’ the ‘symmetry’ of each exon (whether removal of the exon will result in frame-shifting of downstream exons), and via amino-acid coordinates, conserved domain database (CDD) regions (typically domains—human: accessed 16 February 2018, mouse: accessed 23 February 2018), CDD sites (typically catalytic residues, interaction sites, etc.) and Uniprot secondary structure (alpha helices, beta sheets, turns and coiled-coil regions, v2017_09) (40,65–66). We annotated exonic splicing enhancers (‘ESEs’) by searching for 84 known enhancer sequences within principal isoform exons (the ‘INT3’ set) (67). Finally, we appended genome-anchored information including Pfam domains (generated using University of California, Santa Cruz (UCSC) genes, v2017–04-07), and common polymorphisms expected to disrupt any position in the 23 nt guide+PAM (≥10% variant allele frequency (VAF) in dbSNP v147, human only) (68–70). In this manuscript we use ‘domain’ to mean ‘a region of amino acids in a protein, identified by ortholog comparisons and potentially annotated with known or predicted function’ (e.g. CDD, PFam), ‘protein secondary structure’ to mean ‘a region of amino acids in a protein known or predicted to form a local biochemical structure’ (e.g. Uniprot alpha helices) and ‘protein conservation’ to mean ‘an amino-acid-resolution score which tracks invariability across known orthologs’ (e.g. our novel conservation score).Generation of a novel genome-wide conservation scoreWe downloaded data from a published experiment in which the homology-dependent repair (HDR) competency of a series of BRCA1 RING domain mutations was experimentally tested (71). These data included mutation effect predictions from public tools (SIFT, Polyphen, CADD and GERP), and a specific prediction combining experimentally measured binding affinity of BRCA1 to its heterodimer BARD1 and E3 ligase activity of the BRCA1–BARD1 complex (‘HDR_predictions’). We then predicted mutation effects using PROVEAN, a conservation-based variant effect predictor which compares alignment of the wild type protein to an ortholog set versus alignment of the mutant protein to the same ortholog set (72). We included predictions using PROVEAN’s web app, and using the command-line tool with either the 2011 BLAST non-redundant database (used in the PROVEAN publication), or the current 2017 version (Supplementary Table S7) (73).Next, we compared configurations of PROVEAN to nucleotide conservation metrics and their correlation with phenotypic readout of sgRNA activity. First, we aggregated public sgRNA tiling data for the essential gene PLK1 (from Munoz et al., average of three cell lines), the MAPK-pathway-inhibitor resistance gene MED12 (from Doench et al., average of two inhibitors) and the contextually essential mouse gene Smarca4 (from Shi et al.) (see section ‘Model Training Dataset’ for more details) (13,47,74). Next, we ran PROVEAN while modulating three parameters in 2 × 2 × 3 = 12 configurations. These parameters were: single-amino-acid deletions versus substitutions (averaging all 20), the position-weight matrix BLOSUM62 (default) versus a custom identity matrix (scored: +5 for identical AAs and −2 for different) and three ortholog sets: the 2011 BLAST non-redundant (‘NR’) database (used in the original PROVEAN publication), the 2011 BLAST non-redundant database with low-evolutionary-distance proteins ‘pruned’ (‘NRp,’ specifically removing synthetic, Human, Orangutan, Macaque and Chimpanzee proteins), or manually curated ortholog sets from Homologene (Supplementary Table S8) (73,75). Next, we used the UCSC table browser to download the nucleotide conservation tracks phyloP, phastCons and phastConsElements for human (hg38, 100way alignments) and mouse (mm10, 60way alignments), downloaded multiple sequence alignments for the manually curated ortholog sets from Homologene and computed percent identity at each reference residue, and computed moving averages for phyloP (3, 6, 9 nt windows), phastCons (3, 6, 9 nt), and homologene (0–4aa) (Supplementary Table S9) (76).Next, we compared simulated deletions of varying lengths using PROVEAN. Protein sequences for PLK1, MED12 and Smarca4 were acquired from NCBI, the effect of the deletion of every amino acid (1AA del), every pair of amino acids (2AA del) and every trio of amino acids (3AA del) was assessed using the PROVEAN web app, and multiplied by −1 such that larger positive numbers correspond to higher conservation (Supplementary Figure S23 and Table S26).Finally, to generate a genome-wide protein conservation score, we used PROVEAN and Blast NR v2011 to predict the detrimental impact of the deletion of every single amino acid in APPRIS principal isoforms of Refseq genes (human and mouse) and multiplied by −1 such that larger positive numbers correspond to higher conservation. We named this score AADelCons, for amino acid deletion conservation.Model training datasetWe assembled a comprehesive sgRNA activity training dataset, spanning seven individual datasets of sgRNA activity measured by log2-fold-changes (LFCs) from five publications, in which genes with clear experimental readouts were tiled by sgRNAs in an unbiased fashion. For datasets where increased sgRNA activity was measured as enrichment (either by drug resistance, or by flow-sorting marker-negative populations), we inverted these into negative LFCs ((-)LFCs) as indicated for internal consistency of the training dataset. Likewise, for datasets including sgRNAs targeting genes which were not expected or observed to show activity (e.g. negative controls), we filtered to genes either known to be essential (Achilles 17Q4) or achieving a mean-sgRNA LFC <−1 in all cell lines tested and list exclusions below (8). The sgRNA datasets were: (i) (-)LFCs in flow-sorted antigen-negative cells for cell-surface human genes (15); (ii) (-)LFCs in flow-sorted antigen-negative cells for cell-surface mouse genes (excluding Cd2, Cd3e and Cd53, which the original authors excluded, H2-K which was not targeted specifically and sgRNAs at near-zero abundance in controls) (15); (iii) (-)LFCs in inhibitor-resistant cells in three pooled synthetic lethality proliferation screens (excluding TOP2A, CDK6, MLH1, MSH2, MSH6 and PMS2, which the original authors excluded, CLDN10 which the original authors do not describe and shows no activity with any drug and sgRNAs the original authors QC-flagged) (13); (iv) (-)LFCs in inhibitor-resistant cells in a pooled synthetic lethality proliferation screen (excluding non-exonic sgRNAs) (77); (v) LFCs in a pooled proliferation screen (filtered for essential genes, and excluding non-exonic sgRNAs) (77); (vi) LFCs in a pooled proliferation screen (filtered for essential genes) (74); (vii) LFCs in a pooled proliferation screen (47); see Supplementary Table S10 for a full gene list. We further filtered these input sgRNA<>LFC measurements (n = 43 307) to sgRNAs targeting coding sequence in our database (n = 41 611), converted LFCs to z-scores on a per experiment and per gene basis (see Supplementary Tables S11 and 12 for feature descriptions and data) and filtered to sgRNAs which target specifically (specificity score > 0.50 and zero off-targets in the genome with ≤1 mismatch; n = 27 508).Gene set definitions for validation experimentIn order to compare the activity of multiple CRISPR libraries across three cell lines (described later), we defined lists of genes expected to be essential to all lines (‘essential genes’), essential to one or two lines (‘cell line-selective genes’), or essential to no lines (‘negative control genes’).For essential genes, we intersected lists identified by RNA interference (the ‘CCE’ set from Hart et al., n = 217) and CRISPR (the ‘CEG’ set from Hart et al., n = 684), reasoning that multiple evidence types would yield a robust and conservative essential gene set (n = 133) (37,78). We observed good agreement between the two lists, which bolstered our confidence in using the intersection between the two lists and the cutoff of −log10(p) > 3 for our validation experiment (Supplementary Figure S13). We labeled 33 of these for which we had prior screening experience suggesting their strong essentiality, 30 of which were significant in Hart et al. (37)’s Bayes factor essentiality scores (P < 0.001 in a t-test versus 0). Next, noting a discrepancy between apparent essential structural proteins versus enzymes, we annotated the domain architecture of each gene. We defined ‘broad domain’ genes as having a single domain covering >90% of the protein sequence (representing 12% of genes in the genome), and ‘narrow domain’ genes as having either multiple domains covering <85% of the protein sequence, or a single domain covering <50% of the protein sequence (representing 52% of genes in the genome). Finally, we ranked genes by our known significant essential list and the mean Hart ’17 Bayes factor, and selected the top 18 in each domain group for this experiment (see Supplementary Table S14 for scores related to this process).For cell line-selective genes, we downloaded and joined CRISPR and copy number variation (CNV) data from Achilles (18Q3, n = 17 520 genes) for DLD1, HS766T and HCC1428. We identified pairwise differential vulnerabilities by a CERES score difference greater than 0.5 and with copy-neutrality in both lines (2.5 > copy number > 1.5) (n = 177 total genes). Of these, we automatically included all kinases (n = 5) and transcription factors (n = 9) in the experiment. Using the remaining 163 genes, we identified the top two differential vulnerabilities in both directions for each line versus the other two as: CERES difference >0.50 for each cell line versus cell line comparison, CERES <−0.50 in sensitive lines, CERES > −0.30 in non-sensitive lines (except in one case which didn’t yield two genes) and taking the top two based on mean CERES difference (n = 12) (see Supplementary Table S15 for scores related to this process).For negative control genes, we used the Achilles data (18Q3) to identify copy-neutral genes in all three cell lines with low mean CERES scores (|CERES| < 0.2) and low standard deviations across cell lines (sd < 0.15). Of these, we arbitrarily selected fifteen solute carriers (SLCs) and 13 olfactory receptors (ORs) (see Supplementary Table S16 for scores related to this process).Plasmid construction and sgRNA cloningTo construct sgRNA expression vectors, 83-nt DNA sequences were assembled by flanking 20nt guide protospacers (Supplementary Table S19) with Esp3I restriction sites and polymerase chain reaction (PCR) primer templates (Supplementary Table S20). Lyophilized and cleaved 83mers were ordered from Agilent (5000 oligo ‘SureGuide Unamplified Custom CRISPR Library’, Part Number: G7555B), re-suspended in water and amplified by PCR. Next, guides were excised by Esp3I at 21°C for 3 h and ligated into a modified version of Tarumoto et al.’s LRG2.1 vector, in which green fluorescence protein was replaced by puromycin resistance (pac) using BamHI+BsrGI (54,55). Finally, vectors were electroporated into four replicates of STBL4 cells and pooled.A constitutive Cas9 expression vector was constructed as follows. First, an empty lentiviral vector with ampicillin resistance (pLV7-Empty) was constructed by restricting pLenti7.3/V5-DEST (Thermo Fisher Scientific, #V53406) with ClaI and Acc65I, then re-ligating by duplexed oligonucleotides (IDT, Supplementary Table S19). The EF-1 Alpha short (EFS) promoter was added (pLV7-EFS), by PCR amplifying EFS, restricting pLV7-Empty with XbaI and BamHI, and ligation. Next, the reverse-complement DNA sequence encoding human-optimized 3× FLAG-tagged Cas9 was extracted from its source publication (79). Cas9 DNA was synthesized in three fragments with flanking BamHI and NheI restriction sites (ATUM, DNA2.0), 2A self-cleaving peptide sequence (P2A) oligonucleotides were synthesized (IDT, Supplementary Table S19) and the hygromycin resistance gene (hph, Hygro) was synthesized with flanking Esp3I and MluI restriction sites (ATUM, DNA2.0). Finally, Cas9, P2A and Hygro were ligated together, pLV7-EFS was restricted by BamHI and MluI, and Cas9-P2A-Hygro was inserted by ligation to generate the final vector (pLV7-EFS-Cas9-P2A-Hygro). sgRNA and Cas9 cloning vectors were illustrated using AddGene's Giraffe software, (Supplementary Figure S16) (80).Cell cultureCell lines were purchased from the American Type Culture Collection (ATCC). HEK293T, DLD-1 and HCC1428 were maintained in Dulbecco's modified Eagle's medium (DMEM) + 10% fetal bovine serum (FBS) and Hs776T was maintained in Roswell Park Memorial Institute media (RPMI) + 10% FBS, with HEK293T also receiving high glucose, pyruvate and 1% penicillin/streptomycin (Thermo Fisher Scientific), and HCC1428 also receiving 1% HEPES. All cells were maintained in humidified incubators at 37°C and 5% CO2. Cell lines stably expressing Cas9 were generated by lentiviral transduction at a low multiplicity of infection (MOI) to introduce a single Cas9 copy per cell and selected using hygromycin at 250 μg/ml (DLD1) or 400 μg/ml (Hs766T and HCC1428). Cells expressing sgRNA were selected using puromycin at 2 μg/ml.Lentiviral transductionLentivirus was produced by transfecting cells with helper plasmids. Briefly, 10 mg of plasmid DNA, 5 mg of VSVG, 7.5 mg of psPAX2 and 40 mL of 1 mg/ml PEI were mixed, incubated and added to the 10 cm plate HEK293T cells. Media was replaced 6–8 h post-transfection. Virus was collected 48 and 72 h post-transfection and pooled.Cell lines were infected with virus-containing plain medium for 24 h. Medium was then replaced with puromycin containing medium to select for transduced cells, and incubated for 48 h. The MOI was determined at 72-h after infection by comparing the percent survival of infected cells to separate pools of cultured non-infected control cells. Optimal infection conditions were determined for each batch of virus prep in each cell line. Volumes of virus that yielded ∼10–30% infection efficiency were used for screening.Pooled screening and sequencing for sgRNA abundanceFor all screens, cells with stable Cas9 expression were infected in three biological replicates per cell line with lentiviral sgRNA pools at a representation of 230–1000 cells per sgRNA at a MOI of 0.06 for Hs766T, 0.15 for HCC1428 and 0.30 for DLD1. MOI values varied by the infectivity of each cell line, but all obtained MOIs were below the target of 0.30. Cells were selected in the presence of puromycin, and a sample was collected 3 days post-selection as a reference representation of the pooled sgRNA library (except HCC1428). Cells were propagated for a total of 14 or 21 (HCC1428 only) days with an average representation of >1000 cells/sgRNA maintained at each passage. Cells were harvested for genomic DNA extraction (DNeasy Blood and Tissue kit, Qiagen Cat#69506). sgRNA inserts were first PCR amplified using primary PCR primers (Supplementary Table S20) on the vectors and purified by QIAquick PCR Purification Kit (Qiagen, Cat#28106). The PCR products from the genomic DNA were then further amplified using secondary PCR primers (Supplementary Table S20) harboring Illumina TruSeq adapters i5 and i7 barcodes. The 300 bp PCR product were purified by gel extraction (Qiagen, Cat#28706). The resulting fragments were sequenced on a MiSeq™ (Illumina) with standard primers for dual indexing. The sequencing recipe we used included 33 ‘dark cycles’ of base incorporation without imaging, followed by 21 light cycles with two indices.Screen analysisFollowing sequencing, sgRNAs were counted with a custom pipeline based on Bowtie and Oculus (81,82). Reads per million mapped reads (RPMs) were computed for each sample, then normalized to the total abundance of guides targeting negative control genes (across libraries, such that the normalization factor for each cell line was the same) to account for variation between cell lines. sgRNA-level log2-fold-changes (LFCs), gene-level LFCs and gene-level coefficient of variation across sgRNAs (CV = standard deviation/mean) were computed with custom R code and gene-level beta values were computed with MAGeCK-VISPR’s maximum likelihood estimation model using negative control normalization (83). We plotted the distributions of counts in each gene category over time (Supplementary Figure S17). Next, we noted and plotted gene-level hit discrepancies between what we expected from Achilles versus what we observed. We defined an observed hit as reaching a mean dropout of more than 2-fold (LFC <−1) in at least one library. We included only genes both expected and observed to drop out in subsequent inter-library comparisons (‘hits’ in Figure ​Figure44 and Supplementary Figures S19 and S20). Finally, we assessed library performance using four different metrics: sgRNA LFCs, Gene LFCs, Gene-level beta values (from MAGeCK) and intra-gene CVs for all hits on a per-cell-line, per-library basis and plotted using custom R code (Supplementary Figure S19). T-tests were performed for each metric between each library and PINCER. We repeated this analysis with PINCER in a pseudo-4sgRNA/gene format, by analyzing only the top four sgRNAs selected by PINCER (picks i–iv) (Supplementary Figure S20). We also repeated this analysis for sgRNAs targeting negative control genes (olfactory receptors and solute carriers), omitting the negative control normalization step for both LFCs and MAGeCK, for PINCER in a six-guide format (Supplementary Figure S21) or four-guide format (Supplementary Figure S22).Open in a separate windowFigure 4.PINCER outperforms other libraries in effect size and robustness. CRISPR screens were conducted in the human cancer cell lines DLD1, HCC1428 and HS766T using pooled sgRNAs from four libraries (PINCER, TKOv3, Avana and Brunello) targeting pan-essential, selectively essential and negative control genes. Violin plots show gene-level log2-fold-changes, of reads per million mapped reads normalized to negative control reads (combining all libraries), between plasmid and endpoint for essential and cell-line-selective genes, and intra-gene coefficient of variation (across sgRNAs targeting the same genes), with paired and unpaired t-tests against the PINCER library.

Article TitlePINCER: improved CRISPR/Cas9 screening by efficient cleavage at conserved residues


The human and mouse genome-wide PINCER libraries, guide databases, conservation scores and all code are available on Github: (

Login or Signup to leave a comment
Find your community. Ask questions. Science is better when we troubleshoot together.
Find your community. Ask questions. Science is better when we troubleshoot together.

Have a question?

Contact or check out our support page.