p3-format-results

Format Raw Data for Conversion to HTML.

p3-format-results -d DataDirectory > condensed Output

This tool takes as input an Output Directory created by p3-related-by-clusters. It produces a condensed text for of the computed clusters.

These text-forms can be run through p3-aggregates-to-html to get versions that can be perused by a biologist.

The input directory contains a list of protein family pairs in a file called related.signature.families. Each record consists of three tab-delimited columns– (0) first family ID, (1) second family ID, and (2) count.

For each pair, we want to create an output group that shows the pairing (and the intervening features) for each genome. Each such group consists of the following.

  • coupling_header

A header consisting of the string ////.

  • count

Immediately after the header, a tab-delimited line containing (0) the first family ID, (1) the second family ID, and (2) the occurrence count.

  • genome_header

Following the count, there are one or more genome groups. Each starts with a header containing ###, a tab, and the genome name.

  • feature

Inside the genome group there are one or more feature records. The feature record is tab-delimited, and consists of (0) a feature ID, (1) a protein family ID, and (2) a functional assignment.

The count records are taken directly from related.signature.families. The feature records are taken from the iteration files in the data directory. These files are found in the subdirectory <CS>, and each genome group is separated by a // line.

Our strategy will be to read in all the genome groups and track the genome and the protein family set for each. When we process a protein family pairing, we will output all of the genome groups that have both protein families present.

Parameters

There are no positional parameters.

Standard input is not used.

The additional command-line options are as follows.

  • d DataDirectory