CLARK
Fast, accurate and versatile sequence classification system

Computer Science & Engineering Department
University of California
Riverside, CA 92532

General

Download

F.A.Q.

Overview


This webpage provides a global summary of the CLARK software tool: its advantages, its installation process, and how to run it and analyze results. For a complete and thorough overview of the classification system (e.g., kmer-based method, sequence analysis, DNA sequences, speed, efficiency, etc.), we highly recommend the reader to consult the publication and/or README file available in the zipped package.

Introduction


 CLARK is a software tool for classifying any type of DNA/RNA sequences in any format (reads, contigs, scaffolds, etc.), generically called "objects", against a set of reference sequences (reads, contigs, assemblies, etc.), called "targets". Advantages:
  • Interoperability: CLARK runs on latest Mac and Linux OS. Written in C++, CLARK is self-contained and needs only the GCC compiler (version 4.4 or higher) to be installed.
  • Versatility: CLARK can deal with DNA/RNA sequences in several contexts of classification, especially metagenomics and genomics. It supports single-end/paired-end reads and sequences in FASTA/FASTQ format (eventually gzipped).
  • High accuracy: Our experiments results (cf. publication) show that CLARK's precision is better than the best state-of-the-art methods, when classifying metagenomic reads at the genus/species level. Similar results were observed in genomics, when classifying BAC/transcript to chromosome-arms/centromeres, in the case of the barley genome (cf. publication).
  • Ultra speed: Classification speed of CLARK in the context of metagenomics is unmatched. CLARK can take advantage of a multi-core architecture and scales better than its closest competitor.
CLARK can provide useful statistics for post-processing the results. User-friendly, it does not require any external tools or strong background in programming/bioinformatics to run it.
Finally, it is distributed under the GNU GPL license and is fully supported.

Requirements & Installation


  First, make sure you have a 64-bit OS (Mac/Linux) and your GNU GCC compiler version is 4.4 or higher.

Second, download the zipped package of the latest CLARK release (i.e., v1.2.6.1), available from the CLARK webpage ("Download tab"). Or, you can do it by:
$ wget http://clark.cs.ucr.edu/Download/CLARKV1.2.6.1.tar.gz

Third, uncompress the package:
$ tar -xzvf CLARKV1.2.6.1.tar.gz

The manual and source code will be extracted in the sub-directory "CLARKSCV1.2.6.1". Go in this directory.
Finally:
$ ./install.sh

Features available


  The installer built binaries (CLARK, CLARK-l and CLARK-S in the folder "exe"). You can use these binaries for any "object"-to-"target" classification project. In genomics, an object can be a read/contig/scaffold/transcript/BAC and a target can be any genomic sequence (e.g., chromosome, full genome, plasmid, etc.). We detail in the README file how to define the targets (and the centromeres, in the case reference sequences are chromosome-arms) and how to run the executables.
In metagenomics, an object can be a read/contig/sequence/scaffold/transcript and a target is a genome (e.g., a bacteria, etc.).

However, several scripts useful for the classification of metagenomic samples are available (presented in next section). Especially:
  • set_targets.sh and classify_metagenome.sh: They allow to classify metagenomic samples against several databases that you specify. You can choose genomes fromm bacteria/archaea or virus (NCBI/RefSeq database), and Human (latest GRCh38). They will be downloaded automatically if needed. In addition, you can choose any database you have locally in your machine.

  • buildSpacedDB.sh: It creates the sets of discriminate spaced k-mers from the selected databases (cf. section "Running CLARK-S" below).

  • clean.sh: It will erase permanently all data downloaded/generated inside the database directory defined at the last call of "set_target.sh".

  • estimate_abundance.sh: It computes the abundance estimation (count and proportion for each target identified in CLARK results), and can produce input for the Krona executable ktImportTaxonomy, as well as results in the mpa format.

  • evaluate_density_confidence.sh, evaluate_density_gamma.sh: They take in input one or several CLARK results (containing confidence scores) and they output/plot the distribution of assignments per confidence score or gamma score.

  • resetCustomDB.sh: It resets the targets definition with sequences (newly added/modified) of the customized database. Any call of this script must be followed by a run of set_target.sh.

  • updateTaxonomy.sh: To update your local copy of the latest taxonomy data (taxonomy id, accession numbers, etc.) from the NCBI website.

  • makeSummaryTables.sh (new): To build two spreadsheet summarizing results from several CLARK report files. Given a rank r and a minimum abundance level a (i.e., the ratio between the number of reads classified to a taxon and the total number of reads in the report file) expressed in percent, this script produces the following report files:
    "TableSummary_per_Report.csv":
    For each report file, this table indicates the number of reads, the number of assigned or classified reads, the alpha-diversity (using the Shannon-index) of the sample, the minimum abundance level used, the proportion of reads classified in percentage, and the r most dominant taxa/organisms with abundance higher than a (they are ordered by decreasing order of the abundance). The taxa are reported with scientific name and NCBI taxonomy id.
    "TableSummary_per_Taxon.csv":
    For each taxon/organism identified across all the report files, this table indicates the taxonomy id of the taxon, the domain of the taxon, the minimum abundance level used, the number of reports the taxon was found (within the top r taxa with abundance higher than a), and some statistics about the abundance of the taxon (i.e., minimum, maximum, average and standard-deviation in percent).
    "TableSummary_HitCount.csv":
    This table provides the hit count distribution of the targets identified per report, as well as the total number of reads, the total number of reads classified and the alpha-diversity.

  • getTargetsKmers_distribution.sh: To print out the distribution of the target-specific k-mers in a file entitled "targets.distribution.csv" in the working database. The parameters for this script are the key parameters used for creating the database, i.e., the k-mer length and the minimum k-mers frequency (default 0).

  • extractSequences.sh (new) To extract/filter from the input data, the sequences identified to a specific taxon once the classification is done. This can be used for downstream analysis (analysis of contaminants, genome assembly, etc.).
    Parameters are: The taxonomy id (of the taxon to look up), the address of the file containing sequences (fasta/fastq format), the address of the CLARK results file (csv file), and the minimum thresholds for the gamma value and the confidence score (in the case the CLARK results file contains these statistics).

Classification of metagenomic samples


  CLARK can classify accurately and quickly metagenomic samples against one or several databases (downloaded from NCBI or locally in your disk), and analyze results for you thanks to scripts we present in this section.
First, we briefly present "set_targets.sh" and "./classify_metagenome.sh". Second, we present the script to get the abundance estimation (count and proportion), "estimate_abundance.sh", from one or several results file(s). For more details, please refer to the README file available in the CLARK package.

  Setting and classification
We assume that the previous step of installation was successfully done.

       Selecting the database(s)
 Choose/create a directory to store the database(s). Let's name it in a generic way, "DIR_DB", for clarity. This directory can be anywhere in your disk(s), independantly of the CLARK source code.
$ mkdir DIR_DB

Then, indicate what database(s) to consider for the classification. You can select among 'bacteria', 'viruses', 'plasmid', 'plastid', 'protozoa', 'fungi', 'human' and/or 'custom'. Note that when choosing 'bacteria', the complete genomes from RefSeq of both bacteria and archaea will be integrated. For example, to classify against bacteria/archaea genomes only:
$ ./set_targets.sh DIR_DB bacteria
The bacteria (and archaea) genomes from NCBI/RefSeq will be downloaded if they are not present in DIR_DB.

To classify against bacteria/archaea, viruses, fungi and human:
$ ./set_targets.sh DIR_DB bacteria viruses fungi human

Similarly, bacteria, viruses, fungi or human will be downloaded if not present in DIR_DB.

       Creating/Updating the Custom database
 To work only with a custom database, you will need to copy/move the sequences of interest in the directory "Custom", inside DIR_DB. These files must be fasta files with the accession number in the header to be identified. Each fasta file must contain sequences for only one reference sequence or taxon. In summary:
1) create the directory "Custom" inside DIR_DB (if it does not exist yet),
2) copy/move sequences of interest in the Custom folder,
3) run:
$ ./set_targets.sh DIR_DB custom
Later on, if you user want to work with a different customized database (for example, by removing or adding more sequences of interest in the Custom folder) then the targets definition must be reset. So, after the sequences in the Custom folder have been updated, just run:
$ ./resetCustomDB.sh
Then, run set_target.sh with the desired settings (as explained in the previous paragraph).

       Setting the taxonomy rank
 CLARK defines the targets againt a unique taxonomy rank chosen by the user. Objects are classified against all taxa (from database(s) you selected), which are defined all at the same taxonomy level. The default taxonomy rank is species. To change the taxonomy rank to genus, for example, the command line is (from the example above selecting only bacteria):
$ ./set_targets.sh DIR_DB bacteria --genus

To rather classify your objects at the phylum level:
$ ./set_targets.sh DIR_DB bacteria --phylum

There are six ranks available: --species (the default value), --genus, --family, --order, --class or --phylum.

       Running the classification
 To classify your metagenomic samples (say, "sampleA.fa") with default parameters of CLARK:
$ ./classify_metagenome.sh -O sampleA.fa -R resultA

where "resultA" is a filename to store the results.

The option "-O" in front of sampleA.fa indicates to the program that sampleA.fa is the file containing objects to classify, and, similarly, the option "-R" in front of resultA indicates where to store the results. This command will build the database in DIR_DB using default parameters, if the database has not been created yet. Building a database will take about 3 hours to complete, then the classification per se will run and produce the file resultA.csv (the extension '.csv' is appended to the filename since results are in CSV format).

       Processing multiple samples/datasets
The program can run multiple sample/dataset once the database is loaded, in other words, - unlike other classifiers - you do not need to run the program N times if there are N samples/datasets to process. CLARK can load the database with your settings once and then classify as many datasets as needed.
For example, if you want to annotate six datasets (sample1.fa, sample2.fa, ..., sample6.fa), then you can store addresses (physical location in your disk) of these files into one file called "samples.txt", such that:
$ cat samples.txt
sample1.fa
sample2.fa
sample3.fa
sample4.fa
sample5.fa
sample6.fa

and then simply run:
$ ./classify_metagenome.sh -O samples.txt -R samples.txt

Once the computations done, the program has created six results files (CSV format) associated to the samples: sample1.fa.csv, sample2.fa.csv, sample3.fa.csv, sample4.fa.csv, sample5.fa.csv, and sample6.fa.csv.

If you want the results files to have different names, say "result1.csv", "result1.csv", ..., "result6.csv", then you can store these names into a file "results.txt", such that:
$ cat results.txt
result1
result2
...
result6
and then run:
$ ./classify_metagenome.sh -O samples.txt -R results.txt

This scalable fashion to annotate multiple datasets works for single-end reads or paired-end reads. In the case of paired-end reads, you must provide two files (each containing addresses of files for the right/left read). For examples, if you have three datasets (sample1, sample2, and sample3) of paired-end reads then you can create "samples.R.txt" and "samples.L.txt" such that:
$ cat samples.R.txt
sample1.R1
sample2.R1
sample3.R1
$ cat samples.L.txt
sample1.R2
sample2.R2
sample3.R2

You can run CLARK on these datasets of paired-end reads (with option "-P"):
$ ./classify_metagenome.sh -P samples.R.txt samples.L.txt -R results.txt

where, results.txt is:
$ cat results.txt
result1
result2
result3

Results files will be stored in files entitled "sample1.R1.csv", "sample2.R1.csv" and "sample3.R1.csv" (consistently with the input dataset of same prefix).
Or you can simply run:
$ ./classify_metagenome.sh -P samples.R.txt samples.L.txt -R samples.R.txt


You can change the parameters (e.g., the k-mer length, the mode of execution, the variant, the number of parallel threads,...) or specify options for your data (e.g., compressed files,...).

To see the full list of options/parameters available, run:
$ ./classify_metagenome.sh

Please refer to the README file for details in options and provided examples.

  Analyzing results
 We assume that the previous step of setting/classification was successfully done.

The script "estimate_abundance.sh" can analyze the raw CLARK results of a metagenomic sample, and can provide for each target identified, the count and proportion of reads/sequences object assigned to it.
In addition, this script can also digest CLARK's assignments so they can be processed by Krona browser tool (see option "--krona").

To run the estimation, for resultA.csv:
$ ./estimate_abundance.sh -F resultA.csv -D DIR_DB

The "-F" indicates that resultA.csv is the file containing results and "-D" indicates where the database is (to find the taxonomy data and load scientific names of taxa).
You can pass several results files at a time. This script also offers options to filter out classified objects. To see all options available:
$ ./estimate_abundance.sh

Please refer to the README file for details about all options and how to set them thanks to the provided examples.

Running CLARK-S


  The current release of CLARK-S exploits discriminative spaced k-mers (see the peer-reviewed publication). The classification can be done, like CLARK or CLARK-l, at phylum, genus or species level for example. Before classifying your metagenomic sample, you must first create the database of discriminative 31-mers (e.g., bacteria/archaea genomes) and then databases of discriminative spaced 31-mers (using the script "buildSpacedDB.sh").

Step 0: Set the database, for example:
$ ./set_targets.sh DIR_DB bacteria --phylum
or
$ ./set_targets.sh DIR_DB bacteria viruses --species

where "DIR_DB" is the directory to store/copy reference sequences and the database. We recommend to work with bacteria and viruses genomes, and use the species level.

Step 1: Create the discriminative 31-mers of the database you have defined in step 0 (if they do not exist already). This can be done by running the default variant CLARK on the sample of your choice, for example:
$ ./classify_metagenome.sh -O sample.fa -R result

Step 2: Create the databases of discriminative spaced k-mers:
$ ./buildSpacedDB.sh
This task will take several hours to complete (6 to 7 hours).

To classify your metagenome (e.g., sample.fq) with CLARK-S, you need to indicate "--spaced":
$ ./classify_metagenome.sh -O sample.fq -R result --spaced

CLARK-S produces results with statistics (especially, confidence and gamma scores), see README file.

Important notes:
  • Database change: If you want to work with a different database or taxonomy rank, then please restart the procedure described above from the step 0.

  • Lower memory: CLARK-S uses about 108 Gbytes of RAM with default settings. If you want it to use less memory then you can use the sampling factor to set the proportion of data to load in memory. For example, you can set the program to load only half of the discriminative spaced k-mers with the option "-s 2":
    $ ./classify_metagenome.sh -O sample.fq -R result --spaced -s 2

  • Higher speed: In default settings, CLARK-S runs slower than CLARK. To increase the speed, you can run it in parallel (option "-n <numberOfThreads>") or you can run it in express mode (option "-m 2"):
    $ ./classify_metagenome.sh -O sample.fq -R result --spaced -m 2
    or when using 4 threads:
    $ ./classify_metagenome.sh -O sample.fq -R result --spaced -m 2 -n 4

Other applications


  CLARK can be used for other purposes than the classification of metagenomic samples or BAC/transcript. Here is a non-exhaustive list of tasks CLARK can perform:
  • Identification of antimicrobial resistance (AMR) using a database of AMR markers, and similarly identification of virulence factors
  • Identification of chimera and vector contamination in sequenced BACs.
  • Contaminants detection (e.g., bacterial genomes) in draft/finished reference genomes.
  • Definition of centromeric regions using sequences of flow sorted chromosome-arms.
  • Definition of genomic signatures within any taxa.
  • Assignments correction in BAC libraries.

Support


  CLARK is being supported. Any errors or bugs ? Please report them, and help us to improve the tool (see instructions in "Download" tab).
Are you a CLARK user ? Please join the Googlegroup!
Feel free to share any comments and/or suggestions for additional features.