Materials and Methods

CRISPRCasdb a successor of CRISPRdb containing CRISPR arrays andcasgenes from complete genome sequences, and tools to download and query lists of repeats and spacers

MATERIALS AND METHODSDatabase and software design and implementationCRISPRCasdb and associated services are implemented in Microsoft .Net Core 2.2 (multiplatform web application framework), PostgreSQL 9.5 (RDBMS) and Python (database feeding and updates, BLAST jobs management). Both database and web server run on a single 4-cores virtual machine, while a physical server with 64 cores and 128Gb of memory provides the calculation part for CRISPR and Cas detection and BLAST jobs (34). Both machines run in a Linux environment (Ubuntu 16.04).The core application consists of two main programs: CRISPRCasFinder to detect CRISPRs and cas genes and extract them from a genomic sequence, and ‘Database Tools’ for downloading prokaryotic genomes, metadata and taxonomy from the NCBI ftp site, running CRISPRs and Cas detection scripts on downloaded sequences, storing results, and allowing BLAST searches on DRs and spacers stored in the database. CRISPRCasFinder is a full command line tool written in-house in Perl. It is used to process published genome sequences and to feed the CRISPRCas database. It can also be run interactively through the web interface for submission and analysis of users sequence data (28). ‘Database Tools’ are a set of Python and Perl scripts (the workflow is shown in Supplementary Figure S1). Downloading of genomic sequences, CRISPRs and Cas detection, and motifs extraction are fully automated.The .Net Core Framework, providing a set of tools for object-oriented web programming and an integrated web server is used to build a web resource on top of these programs. This preserves platform independence across multiple operating systems and allows the user to interact with the different CRISPR tools programs without computer programming or (shell) scripting skills.The database (CRISPRCasdb)CRISPRCasdb is a relational database implemented using postsgreSQL 9.5. The flowchart on Figure ​Figure1A1A summarizes the different steps of the database constitution. Supplementary Figure S2 shows the Unified Modeling Language (UML) class diagram, and Supplementary Figure S3 shows the tables interactions. Currently, CRISPRCasdb is composed of 15 tables. A BLAST search against lists of repeats and spacers has been implemented (Figure ​(Figure1B1B).Open in a separate windowFigure 1.Workflow for the development of CRISPRCasdb. (A) Workflow for the recovery of genome sequences and associated data, CRISPRCasFinder calculation, storage and display of data. (B) Implementation of CRISPRCasdb-BLAST. Sequences provided in the output of CRISPRCasdb, CRISPRCasFinder, CRISPRCasMeta or directly submitted by users can be blasted against lists of repeats and spacers from the database.The database is regularly updated by adding newly available genomes, and a version of the updater scripts allowing weekly update is being developped. If a major evolution of the CRISPRCasFinder program or associated HMM profiles is released, all the available genomes are downloaded and re-analysed when updating the database. This allows regularly improving the definition of structures when new Cas types and subtypes are defined.In June 2019, all ‘complete genome’ and ‘chromosome’ publicly available in GenBank were recovered from NCBI (35) together with taxonomy information (36), and the database was built using CRISPRCasFinder v4.2.19 program. The selected criteria require that the minimal structure of a putative CRISPR should consist in at least two successive direct repeats with a maximum of one mismatch, separated by one spacer. Tests are performed to classify the putative CRISPRs arrays with evidence level 1 to 4. CRISPRs of less than 4 spacers with three or more perfect repeats are assigned the lowest evidence level. The other CRISPRs are classified based on the conservation of repeats which must be high in a real CRISPR array, and on the similarity between spacers which must be low. We measure CRISPR repeat conservation based on Shannon's entropy and produce an EBcons (entropy-based conservation) index (28). Level 4 CRISPRs are the most reliable ones and levels 1, 2 and 3 must be considered with caution as they may correspond to false CRISPRs. Putative Cas proteins are searched by sequence similarity using HMM protein profiles (15,23). The assignment of a protein to a given subtype is decided based on its compliance with the content and organization defined in each model (one by subtype) of CasFinder v2.0.3 (20,28). Subtypes of class 1 systems are detected from three genes, while class 2 systems necessitate a single signature gene. Thus, if a class 1 cas gene cluster contains less than three genes, or if a cluster has an atypical content or organization, no subtype can be determined. In addition, if the content of the cluster is not informative enough to accurately determine the subtype, the system is called CAS. CRISPR arrays and cas clusters are detected independently of each other. Therefore, CRISPR are indicated, whether or not cas genes are present, and vice-versa.A dump of the database content, and lists of consensus repeats and spacers are provided on the website for download.

Article TitleCRISPRCasdb a successor of CRISPRdb containing CRISPR arrays andcasgenes from complete genome sequences, and tools to download and query lists of repeats and spacers

Abstract

The resource described here is accessible with no restrictions, except for the demand to quote the site.


Login or Signup to leave a comment
Find your community. Ask questions. Science is better when we troubleshoot together.
Find your community. Ask questions. Science is better when we troubleshoot together.

Have a question?

Contact support@scifind.net or check out our support page.