This webpage provides a global summary of the CLARK software tool: its advantages,
its installation process, and how to run it and analyze results.
For a complete and thorough overview of the classification system (e.g., kmer-based method, sequence analysis, DNA sequences, speed, efficiency, etc.),
we highly recommend the reader to consult the publication and/or README file available in the zipped package.
CLARK is a software tool for classifying any type of DNA/RNA sequences in any format (reads, contigs, scaffolds, etc.), generically called "objects",
against a set of reference sequences (reads, contigs, assemblies, etc.), called "targets". Advantages:
CLARK can provide useful statistics for post-processing the results.
User-friendly, it does not require any external tools or strong background in programming/bioinformatics to run it.
- Interoperability: CLARK runs on latest Mac and Linux OS. Written in C++, CLARK is self-contained and needs only the GCC compiler (version 4.4 or higher) to be installed.
- Versatility: CLARK can deal with DNA/RNA sequences in several contexts of classification, especially metagenomics and genomics. It supports single-end/paired-end reads and sequences in FASTA/FASTQ format (eventually gzipped).
- High accuracy: Our experiments results (cf. publication) show that CLARK's precision is better than the best state-of-the-art methods, when classifying metagenomic reads at the genus/species level. Similar results were observed in genomics, when classifying BAC/transcript to chromosome-arms/centromeres, in the case of the barley genome (cf. publication).
- Ultra speed: Classification speed of CLARK in the context of metagenomics is unmatched. CLARK can take advantage of a multi-core architecture and scales better than its closest competitor.
Finally, it is distributed under the GNU GPL license and is fully supported.
Requirements & Installation
First, make sure you have a 64-bit OS (Mac/Linux) and
your GNU GCC compiler version is 4.4 or higher.
Second, download the zipped package of the latest CLARK release (i.e., v22.214.171.124),
available from the CLARK webpage ("Download tab"). Or, you can do it by:
$ wget http://clark.cs.ucr.edu/Download/CLARKV126.96.36.199.tar.gz
Third, uncompress the package:
$ tar -xzvf CLARKV188.8.131.52.tar.gz
The manual and source code will be extracted in the sub-directory "CLARKSCV184.108.40.206".
Go in this directory.
The installer built binaries (CLARK, CLARK-l and CLARK-S in the folder "exe").
You can use these binaries for any "object"-to-"target" classification project.
In genomics, an object can be a read/contig/scaffold/transcript/BAC and a target can be any genomic sequence (e.g., chromosome, full genome, plasmid, etc.).
We detail in the README file how to define the targets (and the centromeres,
in the case reference sequences are chromosome-arms) and how to run the executables.
In metagenomics, an object can be a read/contig/sequence/scaffold/transcript and a target is a genome (e.g., a bacteria, etc.).
However, several scripts useful for the classification of metagenomic samples are available (presented in next section). Especially:
- set_targets.sh and classify_metagenome.sh: They allow to classify metagenomic
samples against several databases that you specify. You can choose genomes fromm bacteria/archaea or virus (NCBI/RefSeq database), and Human (latest GRCh38). They will be downloaded automatically if
needed. In addition, you can choose any database you have locally in your machine.
buildSpacedDB.sh: It creates the sets of discriminate spaced k-mers from the selected databases
(cf. section "Running CLARK-S" below).
- clean.sh: It will erase permanently all data downloaded/generated inside the database directory defined
at the last call of "set_target.sh".
- estimate_abundance.sh: It computes the abundance estimation (count and proportion for each
target identified in CLARK results),
and can produce input for the Krona executable ktImportTaxonomy,
as well as results in the mpa format.
- evaluate_density_confidence.sh, evaluate_density_gamma.sh: They take in input one or several CLARK results (containing
confidence scores) and they output/plot the distribution of assignments per confidence score or gamma score.
- resetCustomDB.sh: It resets the targets definition with sequences (newly
added/modified) of the customized database. Any call of this script must be
followed by a run of set_target.sh.
updateTaxonomy.sh: To update your local copy of the latest taxonomy data (taxonomy id,
accession numbers, etc.) from the NCBI website.
makeSummaryTables.sh (new): To build two spreadsheet summarizing results from several
CLARK report files. Given a rank r and a minimum abundance level a (i.e., the ratio
between the number of reads classified to a taxon and the total number of reads in the report file)
expressed in percent, this script produces the following report files:
For each report file, this table indicates the number of reads, the number of
assigned or classified reads, the alpha-diversity (using the Shannon-index) of the sample, the minimum abundance level used,
the proportion of reads classified in percentage, and the r most dominant taxa/organisms
with abundance higher than a (they are ordered by
decreasing order of the abundance). The taxa are reported with scientific name and
NCBI taxonomy id.
For each taxon/organism identified across all the report files, this table indicates
the taxonomy id of the taxon, the domain of the taxon, the minimum abundance level used, the number of reports
the taxon was found (within the top r taxa with abundance higher than a),
and some statistics about the abundance of the taxon (i.e., minimum, maximum, average and
standard-deviation in percent).
This table provides the hit count distribution of the targets identified per report,
as well as the total number of reads, the total number of reads classified and the
getTargetsKmers_distribution.sh: To print out the distribution of the target-specific
k-mers in a file entitled "targets.distribution.csv" in the working database.
The parameters for this script are the key parameters used for creating the database,
i.e., the k-mer length and the minimum k-mers frequency (default 0).
extractSequences.sh (new) To extract/filter from the input data, the sequences identified to a specific taxon once the classification is done.
This can be used for downstream analysis (analysis of contaminants, genome assembly, etc.).
Parameters are: The taxonomy id (of the taxon to look up), the address of the file containing sequences (fasta/fastq format),
the address of the CLARK results file (csv file), and the minimum thresholds for the gamma value and the confidence score
(in the case the CLARK results file contains these statistics).
Classification of metagenomic samples
CLARK can classify accurately and quickly metagenomic samples against
one or several databases (downloaded from NCBI or locally in your disk),
and analyze results for you thanks to scripts we present in this section.
First, we briefly present "set_targets.sh" and "./classify_metagenome.sh".
Second, we present the script to get the abundance estimation (count and proportion),
"estimate_abundance.sh", from one or several results file(s).
For more details, please refer to the README file available in the CLARK package.
Setting and classification
We assume that the previous step of installation was successfully done.
Selecting the database(s)
Choose/create a directory to store the database(s).
Let's name it in a generic way, "DIR_DB", for clarity. This directory can be anywhere in your disk(s),
independantly of the CLARK source code.
$ mkdir DIR_DB
Then, indicate what database(s) to consider for the classification. You can select
among 'bacteria', 'viruses', 'plasmid', 'plastid', 'protozoa', 'fungi', 'human' and/or 'custom'.
Note that when choosing 'bacteria', the complete genomes from RefSeq of both bacteria and archaea will
For example, to classify against bacteria/archaea genomes only:
$ ./set_targets.sh DIR_DB bacteria
The bacteria (and archaea) genomes from NCBI/RefSeq will be downloaded if they are not present
To classify against bacteria/archaea, viruses, fungi and human:
$ ./set_targets.sh DIR_DB bacteria viruses fungi human
Similarly, bacteria, viruses, fungi or human will be downloaded if not present in
Creating/Updating the Custom database
To work only with a custom database, you will need to copy/move the sequences
of interest in the directory "Custom", inside DIR_DB.
These files must be fasta files with the accession number in the header to be identified.
Each fasta file must contain sequences for only one reference sequence or taxon.
1) create the directory "Custom" inside DIR_DB (if it does not exist yet),
2) copy/move sequences of interest in the Custom folder,
$ ./set_targets.sh DIR_DB custom
Later on, if you user want to work with a different customized database (for example, by removing
or adding more sequences of interest in the Custom folder) then the targets definition
must be reset.
So, after the sequences in the Custom folder have been updated, just run:
Then, run set_target.sh with the desired settings (as explained in the previous paragraph).
Setting the taxonomy rank
CLARK defines the targets againt a unique taxonomy rank chosen by the user.
Objects are classified against all taxa (from database(s) you selected), which are defined all at the same taxonomy level.
The default taxonomy rank is species. To change the taxonomy rank to genus, for example,
the command line is (from the example above selecting only bacteria):
$ ./set_targets.sh DIR_DB bacteria --genus
To rather classify your objects at the phylum level:
$ ./set_targets.sh DIR_DB bacteria --phylum
There are six ranks available: --species (the default value), --genus, --family, --order, --class or --phylum.
Running the classification
To classify your metagenomic samples (say, "sampleA.fa") with
default parameters of CLARK:
$ ./classify_metagenome.sh -O sampleA.fa -R resultA
where "resultA" is a filename to store the results.
The option "-O" in front of sampleA.fa indicates to the program that sampleA.fa
is the file containing objects to classify,
and, similarly, the option "-R" in front of resultA
indicates where to store the results.
This command will build the database in DIR_DB
using default parameters, if the database has not been created yet.
Building a database will take about 3 hours to complete, then the classification
per se will run and produce the file resultA.csv (the extension '.csv' is appended
to the filename since results are in CSV format).
Processing multiple samples/datasets
The program can run multiple sample/dataset once the database is loaded, in other words, - unlike other classifiers - you do not need
to run the program N times if there are N samples/datasets to process. CLARK can load the database with your settings once
and then classify as many datasets as needed.
For example, if you want to annotate six datasets (sample1.fa, sample2.fa, ..., sample6.fa), then you can store addresses (physical location in your disk)
of these files into one file called "samples.txt", such that:
$ cat samples.txt
and then simply run:
$ ./classify_metagenome.sh -O samples.txt -R samples.txt
Once the computations done, the program has created six results files (CSV format) associated to the samples:
sample1.fa.csv, sample2.fa.csv, sample3.fa.csv, sample4.fa.csv, sample5.fa.csv, and sample6.fa.csv.
If you want the results files to have different names, say "result1.csv", "result1.csv", ..., "result6.csv", then you
can store these names into a file "results.txt", such that:
$ cat results.txt
and then run:
$ ./classify_metagenome.sh -O samples.txt -R results.txt
This scalable fashion to annotate multiple datasets works for single-end reads or paired-end reads. In the case of paired-end reads, you
must provide two files (each containing addresses of files for the right/left read). For examples, if you have three datasets (sample1, sample2, and sample3)
of paired-end reads then you can create "samples.R.txt" and "samples.L.txt" such that:
$ cat samples.R.txt
$ cat samples.L.txt
You can run CLARK on these datasets of paired-end reads (with option "-P"):
$ ./classify_metagenome.sh -P samples.R.txt samples.L.txt -R results.txt
where, results.txt is:
$ cat results.txt
Results files will be stored in files entitled "sample1.R1.csv", "sample2.R1.csv" and "sample3.R1.csv" (consistently
with the input dataset of same prefix).
Or you can simply run:
$ ./classify_metagenome.sh -P samples.R.txt samples.L.txt -R samples.R.txt
You can change the parameters (e.g., the k-mer length, the mode of execution, the variant,
the number of parallel threads,...) or specify options for your data
(e.g., compressed files,...).
To see the full list of options/parameters available, run:
Please refer to the README file for details in options and provided examples.
We assume that the previous step of setting/classification was successfully done.
The script "estimate_abundance.sh" can analyze the raw CLARK results of a metagenomic sample,
and can provide for each target identified, the count and proportion of reads/sequences object
assigned to it.
In addition, this script can also digest CLARK's assignments so they can be processed
by Krona browser tool (see option "--krona").
To run the estimation, for resultA.csv:
$ ./estimate_abundance.sh -F resultA.csv -D DIR_DB
The "-F" indicates that resultA.csv is the file containing results and "-D"
indicates where the database is (to find the taxonomy data and load scientific names of taxa).
You can pass several results files at a time. This script also offers
options to filter out classified objects. To see all options available:
Please refer to the README file for details about all options and how to set them thanks to
the provided examples.
The current release of CLARK-S exploits discriminative spaced k-mers (see the peer-reviewed publication).
The classification can be done, like CLARK or CLARK-l, at phylum, genus or species level for example.
Before classifying your metagenomic sample, you must first create the database of discriminative 31-mers (e.g., bacteria/archaea genomes)
and then databases of discriminative spaced 31-mers (using the script "buildSpacedDB.sh").
Step 0: Set the database, for example:
$ ./set_targets.sh DIR_DB bacteria --phylum
$ ./set_targets.sh DIR_DB bacteria viruses --species
where "DIR_DB" is the directory to store/copy reference sequences and the database.
We recommend to work with bacteria and viruses genomes, and use the species level.
Step 1: Create the discriminative 31-mers of the database you have defined in step 0 (if they do not exist already).
This can be done by running the default variant CLARK on the sample of your choice, for example:
$ ./classify_metagenome.sh -O sample.fa -R result
Step 2: Create the databases of discriminative spaced k-mers:
This task will take several hours to complete (6 to 7 hours).
To classify your metagenome (e.g., sample.fq) with CLARK-S, you need to indicate "--spaced":
$ ./classify_metagenome.sh -O sample.fq -R result --spaced
CLARK-S produces results with statistics (especially, confidence and gamma scores), see README file.
Database change: If you want to work with a different database or taxonomy rank, then please restart the procedure described above from the step 0.
Lower memory: CLARK-S uses about 108 Gbytes of RAM with default settings.
If you want it to use less memory then you can use the sampling factor to set the proportion of data to load in memory.
For example, you can set the program to load only half of the discriminative spaced k-mers with the option "-s 2":
$ ./classify_metagenome.sh -O sample.fq -R result --spaced -s 2
Higher speed: In default settings, CLARK-S runs slower than CLARK.
To increase the speed, you can run it in parallel (option "-n <numberOfThreads>")
or you can run it in express mode (option "-m 2"):
$ ./classify_metagenome.sh -O sample.fq -R result --spaced -m 2
or when using 4 threads:
$ ./classify_metagenome.sh -O sample.fq -R result --spaced -m 2 -n 4
CLARK can be used for other purposes than the classification of
metagenomic samples or BAC/transcript.
Here is a non-exhaustive list of tasks CLARK can perform:
- Identification of antimicrobial resistance (AMR) using a database of AMR markers, and similarly identification of virulence factors
- Identification of chimera and vector contamination in sequenced BACs.
- Contaminants detection (e.g., bacterial genomes) in draft/finished reference genomes.
- Definition of centromeric regions using sequences of flow sorted chromosome-arms.
- Definition of genomic signatures within any taxa.
- Assignments correction in BAC libraries.
CLARK is being supported. Any errors or bugs ? Please report them, and help us to improve the tool (see instructions in "Download" tab).
Are you a CLARK user ? Please join the Googlegroup!
Feel free to share any comments and/or suggestions for additional features.