CLARK
Fast, accurate and versatile sequence classification system

Computer Science & Engineering Department
University of California
Riverside, CA 92532

New version available (V1.3.0.0, May 2024)!

General

Download

F.A.Q.

Overview

F.A.Q.


Q1:  What are the requirements to install and run CLARK ?
  Unlike most other metagenomic classifiers, CLARK does not require any tool from the BLAST family, jellyfish or other external tool. The main requirement is a 64-bit operating system (Linux or Mac), and the GNU GCC to compile version 4.4 or higher. Multithreading operations are assured by memory-mapped file functions and the openmp libraries. If these libraries are not installed, then CLARK can only run in single-threaded task.

Q2:  How do I install CLARK ?
  First, download the package of the latest release (from the "Download" tab) and uncompress it (use the command "tar -xvf ./CLARKV1.3.0.0.tar.gz"). All scripts, source code and instructions to install and run CLARK will be available in the sub-directory "CLARKV1.3.0.0" Go into this directory ("cd ./CLARKV1.3.0.0/").
Second, in the README file, you can find instructions for the installation: Execute "./install.sh", and that's it! The installer will build binaries (CLARK, CLARK-l and CLARK-S).

Q3:  Can I install and run CLARK on my Windows system ?
  For the moment, we provide the source code so that CLARK can be installed and run on Linux or Mac OS. However, we are currently working on a version for latest Windows OS (Vista/7/8/10). If you want to receive an email invitation to download the Windows version of CLARK as soon as it is ready, please send a request by email to clark.ucr.help at gmail.com.

Q4:  What are the differences between CLARK and CLARK-l ?
  CLARK may use significant amount of RAM, especially when building set of discriminative k-mers. In 2016, We have run CLARK on a Dell PowerEdge T710 server (192 GB of RAM) for all our experiments (on all barley sequences, and metagenomic sequences described in the paper). Up to 156 GB are needed to build discriminative k-mers for all the bacterial/archaeal and viruses genomes from NCBI/RefSeq. In the classification stage, CLARK uses 59 GB of RAM in default mode.
In the case you want to use a 4 GB RAM laptop on small metagenome, we recommend CLARK-l. Our experiments show that, on a same set of objects to classify, CLARK-l's speed is about 50% of the CLARK' speed but CLARK-l has a much lower sensitivity because it uses only 4% of the memory needed by CLARK. Hence, keep in mind that CLARK-l does not achieve the best performance in accuracy/speed and we do not recommend using CLARK-l as a replacement for CLARK or CLARK-S..

Q5:  How do I choose the right kmer length and other parameters values to run CLARK ?
  This is a very important question. The kmer length has a critical impact on the classification performance. As detailed in the paper, if the kmer length is too small or too big, the sensitivity becomes too low. Our experiments show that [19-22] is an optimal range for high sensitivity when working with a database of bacterial genomes. However, for high precision, you should use high kmer lengths (> 26).
The version v1.* supports kmer lengths up to 32, and we recommend users to work with k=31, the default value, for high precision/speed.

Q6:  CLARK is a taxonomy-dependent binning tool. Does CLARK need (draft/finished) genome sequences for the reference sequences ?
  Not at all. The reference sequences can be reads, contigs, scaffolds, etc. as well as draft/complete reference sequences. In our paper, we worked with reads for the assignments of barley BACs/unigenes.

Q7:  How does CLARK break ties if a read/kmer can be assigned to more than one organism, at the species-level? For example, there are two species sharing locally the same genomic sequence and my read share some k-mers from this genomic sequence, so how does the algorithm decide which one should be used for the assignment?
  This is an excellent question. This is precisely why CLARK considers only discriminative kmer match (i.e., k-mers shared by multiple species are ignored). Thus, if a read is a common sequence in multiple species, then CLARK will not assign it (how could it ?).
Now, it is possible in the case of chimeric contigs or naming issues in the reference sequences, that a sequence S shares k-mers only with the species A (e.g., at the beginning of the sequence S) but also shares k-mers with only the species B (at the end of S), then CLARK will unlikely assign S at the species-level. And if it does then the confidence score and/or the gamma score will be low. This is why, we introduced the concept of "high confidence assignments" in the CLARK paper. Finally, to solve this problem, we recommend you run CLARK (or CLARK-S) with full mode and set high thresholds for both the confidence score and the gamma score.

Q8:   Unlike other classifiers, CLARK works at one level of the taxonomy tree. What taxonomy level should I work when running CLARK? Is CLARK better at the genus level than species level?
  It depends. The default taxonomy rank is species, as it is the most important taxonomy level. But if you are interested in a genus-level classification, then you should define the targets at the genus level (then CLARK will systematically assign at the genus-level only). However, the genus level (or species level) may be such a deep classification level that CLARK may not be able to classify some reads (because they come from organisms that are novel/unknown or too distant from any targets defined in the database). If CLARK does not classify your reads at the species or genus level then you can repeat the classification at higher levels (e.g., family or phylum).
However, it is not recommended users run CLARK for several taxonomy ranks. Again, the species-level is the default rank and we recommend users to work only with it. If CLARK can't classify a read at the species-level, then keep in mind that (i) other tools like Kraken will not likely be able to classify it either, and (ii) using a larger database will likely be the best solution to classify it.

Q9:  For each variant (i.e., CLARK, CLARK-l and CLARK-S), several modes of execution are available. What mode should I use?
  It depends on your data and your objectives. There are 4 modes "full", "express", "spectrum" and "default". CLARK and CLARK-l support all these modes, while CLARK-S supports only the full and express mode.
The full mode of CLARK (option "-m 0") loads all discriminative k-mers in RAM and provides useful statistics to post-process assignments computed during the classification (e.g., confidence score and gamma score). It also offers high sensitivity: In metagenomics, the best sensitivity is possible when using k=20, 21 or 22. This is why we recommend users interested in the full mode to work with k=20, 21 or 22. In addition, users must filter results and consider only high confidence assignments (e.g., assignment with confidence score ≥ 0.75 and gamma score ≥ 0.03). This filtering is preset for you with the option "--highconfidence" in the script estimate_abundance.sh.

If your primary concern is the speed only (i.e., do not need high sensitivity, nor detailed results, nor confidence scores) then use the express mode ("-m 2"), the fastest mode. The sensitivity will be lower than that of other modes (e.g., full).

If you do not know what mode to use, then use the default mode ("-m 1"). It executes fast, is precise and uses less RAM than other modes (because it does not load in memory all discriminative k-mers stored in disk for the classification). However, the sensitivity offered by the default mode is lower than that of the full mode.

If your data are not in fasta/fastq files, but in spectrum (i.e., two-column format file, kmer count) then you must the spectrum mode ("-m 3") (see README file).

Q10:   What is the difference between CLARK and CLARK-S?
  Unlike CLARK or CLARK-l, CLARK-S uses multiple spaced k-mers (instead of contiguous k-mers). Because it is based on spaced k-mers from three spaced seeds and loaded in memory, CLARK-S consumes more RAM than CLARK (108 GB instead of 59 GB). However, CLARK-S achieves both high precision and high sensitivity at the same time (see the peer-reviewed publication).
Finally, unlike CLARK, CLARK-S produces results already filtered (i.e., assignments with confidence score < 0.75 or gamma score < 0.06 are rejected). However, you can use a stricter filtering to get more precise results (see option of the script estimate_abundance.sh).

Q11:  I have a powerful server/workstation: When should I use CLARK-S instead of CLARK ?
  First, CLARK-S is specifically designed for metagenomic samples, unlike CLARK, which remains versatile to various context of sequence analysis. Second, if you analyze a metagenomic sample from a poorly known microbial habitat (i.e., it is likely that the RefSeq database does not contain genomes of organisms present in your sample), for example, seawater, etc. then a limited fraction of your reads will be identified by CLARK (because it requires exact matching on long k-mers) while CLARK-S will classify more reads and detect more organisms (the related research work is in under-review). Thus, we recommend you use CLARK-S because it allows mismatches in k-mers queries. You can select it by using the option "--spaced" in the classify_metagenome.sh.

Q12:  Where can I download the simulated datasets you used in your experiments ?
  We have used the three simulated datasets from the Kraken project. We have used a fourth dataset, described in our paper, simHC.20.500, available here.
The zipped datasets used in our preliminary work presented at WABI'15 are available here: A1.10.1000 and B1.20.500

Q13:  I have an error and can't run CLARK on my own sequences. Can you help?
  Yes, please send us an email at clark.ucr.help at gmail.com (see directions in the Download tab). You may also visit the GoogleGroup of CLARK Users.

Q14:  How can I get access to the synthetic datasets you built in the CLARK-S paper ?
  The fourteen synthetic datasets can be downloaded from this dropbox folder here.