.. _cli::p3-build-kmer-db:


################
p3-build-kmer-db
################


***********************************************
Build a Kmer Database from a Table of Sequences
***********************************************


.. code-block:: perl

     p3-build-kmer-db.pl [options] idCol outFile


This script creates a kmer database. The basic model of the database is that we have groups of incoming sequences, each with an ID and a name. So, for
example, a group could be a whole genome with each sequence a contig, or a group could be a specific protein with only one sequence per group-- the
protein itself and the name the protein's role. Names are entirely optional.

The database will map each kmer to a list of the groups to which it belongs. Command-line options allow you to specify that common kmers be eliminated
or that the kmers be discriminating (that is, unique to only one group). The kmer database can then be used as input to various other scripts (such as
:ref:`cli::p3-closest-seqs`).

Parameters
==========


The positional parameters are the column identifier for the column containing the group ID and the name of the output file into which the
kmer database is to be stored. The constant string \ ``fasta``\  can be used for the group ID column if a FASTA file is input. In that case, the sequence ID
is the group ID and the comment is the group name.

The standard input can be overriddn using the options in :ref:`cli-input-options`.

The options in :ref:`cli-column-options` can be used to specify the input column containing the sequence text. The default is the last input column.

Additional command-line options are the following.


- kmerSize
 
 The size of a kmer. The default is \ ``15``\ .
 

- max
 
 The maximum number of times a kmer can appear. A kmer appearing more than the specified number of times is considered common and discarded. A value of \ ``0``\ 
 indicates all kmers should be kept. The default is \ ``10``\ .
 

- nameCol
 
 The index (1-based) or name of the input column containing the group names.
 

- discriminating
 
 If specified, only discriminating kmers (that is, kmers unique to a single group) are kept. In this case, the \ ``--max``\  option is ignored.