p3-signature-families

Compute Family Signatures

p3-signature-families --gs1=FileOfGenomeIds
                      --gs2=FileOfGenomeIds
                      [--min=MinGs1Frac]
                      [--max=MaxGs2Frac]
   > family.signatures

This script compares two genome groups– group 1 contains genomes that are interesting for some reason, group 2 contains genomes that are not. The output contains protein families that are common in the interesting set but not in the other set. The output file will be tab-delimited, with four columns– the number of family occurrences in set 1, the number of family occurrences in set 2, the family ID, and the family’s assigned function.

Parameters

There are no positional parameters. The parameters in Column Options can be used to specify the key column in both input files. The following additional parameters are also supported.

  • gs1

A tab-delimited file of genomes. These are thought of as the genomes that have a given property (e.g. belong to a certain species, have resistance to a particular antibiotic). If omitted, the standard input is used. The genome IDs must be in the last column.

  • gs2

A tab-delimited file of genomes. These are genomes that do not have the given property. If omitted, the standard input is used. The genome IDs must be in the last column. Any genomes present in the gs1 set will be automatically deleted from this list.

  • min

Minimum fraction of genomes in Gs1 that occur in a signature family

  • max

Maximum fraction of genomes in Gs2 that occur in a signature family

  • verbose

Write progress messages to STDERR.