Methods

A comprehensive evaluation of CRISPR lineage recorders using TraceQC

TraceQC pipeline

The TraceQC R package is available at https://github.com/LiuzLab/TraceQC. The tutorial of TraceQC is available at https://github.com/LiuzLab/TraceQC/wiki. The input of TraceQC pipeline are raw sequencing files and annotated barcode construct sequence. The workflow is shown in Fig. S13. The output is CRISPR-induced mutations and various plots of quality evaluation.

Sequencing alignment

The first step of TraceQC is to align reads to the reference sequence. The reference sequence must be annotated with a CRISPR targetable region because the raw reads typically contain adapter sequence. The Needleman-Wunch sequencing alignment with affine gap penalty is applied to each raw sequence. This alignment requires the user to provide four parameters: match, mismatch, gap open penalty, and gap extension penalty. The default parameters are: match = 2, mismatch = -2, gap open = -6, gap extension = -0.1. After the sequencing alignment, the CRISPR targetable region is selected for the next step.

Alignment quality evaluation

The alignment quality filter is used to remove reads that are contaminated. In every CRISPR lineage recorder, the barcode sequence is usually amplified and sequenced specifically. However, some samples could contain up to 20-40% reads that do not come from the CRISPR barcode. Using the alignment score difference, it is usually easy to separate the contaminated sequence from the mutated CRISPR barcode (Fig. S14C). To quantitively determine the contaminated sequence, a decoy sequence library is generated by randomly substitute ε percentage of the sequence. Next, TraceQC trains a local regression model between the substituted percentage and the alignment score. Finally, the sequence below the optimal substitute percentage can be selected.

Identify CRISPR-induced mutations

After the sequencing alignment, the CRISPR targetable region of each sequence is extracted. Next, TraceQC scans through the targetable region of each sequence and extracts every mutation. Basically, TraceQC extracts four properties for each mutation: mutation type (insertion, deletion, and substitution), starting position, indel length, altered sequence. Finally, the TraceQC summarizes every mutation into a count table.

Read merging

In single-cell RNA sequencing, a unique molecular identifier (UMI) is used to distinguish unique mRNA transcripts. The cell barcode is used to determine cell identity. Presumably, when the CRISPR barcode is transcribed, the mRNA transcript from each cell should be identical. Therefore, TraceQC provides merging functions for mutations in each cell. First, for each cell, reads with the same UMI are grouped together. The mutations that appears in more than 50% of reads are retained. During this step, UMI with a low read count should be filtered according to the guideline of the particular single-cell sequencing platform. Next, for each UMI in each cell, a second merger is applied to retain mutations that appear in more than 50% UMI.

QC plots

There are three main aspects that TraceQC evaluates: 1) The quality of sequencing alignment. Using alignment score, TraceQC removes the percentage of contaminated sequences. The method is described in the section: alignment quality evaluation. 2) Mutation patterns can be evaluated from the circular plot. Users can visualize the position of each mutation and identify the mutation hotspots. 3) For single-cell lineage recorders. TraceQC determines the read count distribution of UMI, which allows users to filter. Also, the mutation distribution across cells allows users to understand the lineage recording ability.

Processing of bulk DNA-seq datasets

In this study, we analyzed bulk DNA-seq datasets from Carlin, GESTALT, hgRNA-invitro, and hgRNA-invivo platforms. The bulk DNA-seq dataset of Carlin used paired-end sequencing. We merged read R1 and R2 using software FLASH with default parameters. Next, the dataset is processed as follows:

  1. The merged read is processed using the TraceQC alignment with the default parameter. We used ε = 0.4 to filter out contaminated sequences.
  2. We normalized each sequencing sample using count per million (CPM). Reads with CPM > 10 are retained.
  3. CRISPR-induced mutations are identified using TraceQC.

We processed a part of GESTALT’s in-vivo bulk DNA-seq dataset (GESTALT paper fig. 3). Although both GESTALT and Carlin used paired-end sequencing, the reads of GESTALT cover approximately 60% of the entire barcode, whereas Carlin covers 100%. Therefore, we processed the R1 and R2 of GESTALT separately using the same procedure. Finally, the mutation event of R1 and R2 are merged for each sample.

The hgRNA-invitro dataset used single-end sequencing. The complete dataset is processed using the same procedure as Carlin.

The hgRNA-invivo platform contains 60 independent CRISPR barcodes. A DNA identifier is assigned to each barcode which presumably cannot be edited by CRISPR. Therefore, we first used the 12 base pairs DNA sequence upstream to locate the DNA identifier. In this step, the Levenshtein distance of 1 is allowed for the 12-bp sequence. Next, the 10-bp directly downstream is extracted to match with the identifier. The sequences are grouped by each identifier and processed by the TraceQC pipeline using the same procedure as Carlin.

Processing of single-cell RNA-seq datasets

In this study, we analyzed single-cell RNA-seq datasets from Carlin, LINNAEUS, and scGESTALT. The single-cell dataset of Carlin used 10X Chromium sequencing. First, the raw sequences of CRISPR barcodes are processed with cell ranger V3.1, in which the error-corrected cell barcodes and UMI are identified. Next, the dataset is processed as follows:

  1. The merged read is processed using the TraceQC alignment with the default parameter. We used ε = 0.4 to filter out contaminated sequences.
  2. For each cell, reads are grouped by UMI. UMIs with read count < 10 are filtered out.
  3. CRISPR-induced mutations are identified for each read using TraceQC.
  4. To merge reads in each cell. First, for each UMI, mutations appear in more than 50% reads are retained. Second, for each cell, mutations that appear in more than 50% UMI is retained.

The LINNAEUS also uses 10X Chromium sequencing. First, the raw sequences of CRISPR barcodes are processed with cell ranger V2.0.2, in which the error-corrected cell barcodes and UMI are identified. However, the read length of 10X single-cell RNA-seq cannot cover the entire barcode region. Therefore, we performed semi-global sequencing alignment that does not penalize the end gap. The other processing pipeline is the same as single-cell Carlin.

The scGESTALT uses in-Drops single-cell RNA-seq. It applies the v7 barcode sequence of GESTALT. According to the annotation, we first used the 10 base pairs DNA sequence downstream to extract the UMI. The rest procedure is the same as single-cell Carlin.

Mutation properties analysis

Using the TraceQC identified mutations, we first calculated the number of unique mutations per sample. Each unique mutation is characterized by mutation type (insertion, deletion or substitution), starting index of the mutation according to the reference sequence, mutation length and altered sequence (only for insertion and substitution). The ternary plot shows the relative ratio of unique mutations in each type.

Time-series data analysis

Across all the datasets we analyzed, only Carlin and hgRNA-invitro have time-series data. Briefly, the Carlin platform used CRISPR to target the embryonic stem cells (ESCs) of mouse. In the one-time induction experiment, the ESCs were exposed with low (0.04 μg/ml), medium (0.20 μg/ml) and high (1.0 μg/ml) of Dox. Then, bulk DNA sequencing is performed at 0 hours (before Dox induction), 12-hours, 24-hours, 48-hours, 72-hours and 96-hours. In the pulsed-induction experiment of Carlin, the ESCs is exposed to three pulses of Dox (0.04 μg/ml) every 6 hours. After each exposure, cells were picked for sequencing and further outgrowth.

The experiment design of hgRNA-invitro is similar. In the one-time induction experiment, the 293T human cell line was exposed to Dox (2.0 μg/ml). Then, bulk DNA sequencing is performed at 0 days (before Dox induction), 2 days, and 14 days. In the pulsed-induction experiment of hgRNA, the 293T cells were first exposed to two hours of Dox to create the founder clone. Next, the Dox is removed while the founder clone grew into 6 clones. To further expand the cell populations, 100 cells are selected for further expansion by exposing them to Dox for two hours. In our study, we compared the mutation speed of one-time induction and pulsed induction of A21 barcode (Fig. 3B, 3C).

As for mutation speed analysis, we calculated the percentage of the unmutated sequence out of each sample. Next, we locate the PAM (NGG) sequence of both platforms. Next, we considered the PAM is mutated when either of the two guanines is mutated. When calculating the percentage of PAM for Carlin, all 10 PAMs are treated equally.

Mutation hotspot analysis

For GESTALT and Carlin target array, the construct sequence designs are similar in which each target (contains spacer and PAM) is separated by a short linker sequence. For both platforms, we defined each target as each consecutive sequence that contains spacer, PAM, and linker. To classify deletion into intra-target deletion and inter-target deletion, we simply determined if the starting position and ending position belong to the same target. As for the mutation position analysis, the heatmap and histogram show the average result of all samples.

Mutation dependency analysis

In the mutation dependency analysis, first, we selected an independent experiment of Carlin and hgRNA-invitro before induction because the CRISPR mutations in these samples are least confounded by cell development. Next, for Carlin, scCarlin, GESTALT, scGESTALT, we calculated the mutation dependency between samples that were not taken from the same animal/cell populations.

Article TitleA comprehensive evaluation of CRISPR lineage recorders using TraceQC

Abstract

The CRISPR-Cas9 genome editing-based lineage tracing system is emerging as a powerful tool to track cell lineages at unprecedented scale and resolution. However, the complexity of CRISPR-Cas9 induced mutations has raised challenges in lineage reconstruction, which requires a unique computational analysis framework. Meanwhile, multiple distinctive CRISPR-based high-throughput lineage recorders have been developed over the years in which the data analysis is incompatible across platforms. To address these challenges, first, we present the TraceQC, a cross-platform open-source package for data processing and quality evaluation of CRISPR lineage tracing data. Second, by using the TraceQC package, we performed a comprehensive analysis across multiple CRISPR lineage recorders to uncover the speed and distribution of CRISPR-induced mutations. Together, this work provides a computational framework for the CRISPR lineage tracing system that should broadly benefit the design and application of this promising technology.


Login or Signup to leave a comment
Find your community. Ask questions. Science is better when we troubleshoot together.
Find your community. Ask questions. Science is better when we troubleshoot together.

Have a question?

Contact support@scifind.net or check out our support page.