P3Utils

PATRIC Script Utilities

This module contains shared utilities for PATRIC 3 scripts.

Constants

These constants define the sort-of ER model for PATRIC.

OBJECTS

Mapping from user-friendly names to PATRIC names.

FIELDS

Mapping from user-friendly object names to default fields.

IDCOL

Mapping from user-friendly object names to ID column names.

DERIVED

Mapping from objects to derived fields. For each derived field name we have a list reference consisting of the function name followed by a list of the constituent fields.

Methods

data_options

my @opts = P3Utils::data_options();

This method returns a list of the Getopt::Long::Descriptive specifications for the common data retrieval options. These options include /delim_options plus the following.

  • attr

Names of the fields to return. Multiple field names may be specified by coding the option multiple times or separating the field names with commas. Mutually exclusive with --count.

  • count

If specified, a count of records found will be returned instead of the records themselves. Mutually exclusive with --attr.

  • equal

Equality constraints of the form field-name,value. If the field is numeric, the constraint will be an exact match. If the field is a string, the constraint will be a substring match. An asterisk in string values is interpreted as a wild card. Multiple equality constraints may be specified by coding the option multiple times.

  • lt, le, gt, ge, ne

Inequality constraints of the form field-name,value. Multiple constrains of each type may be specified by coding the option multiple times.

  • in

Multi-valued equality constraints of the form field-name,value1,value2,,valueN. The constraint is satisfied if the field value matches any one of the specified constraint values. Multiple constraints may be specified by coding the option multiple times.

  • required

Specifies the name of a field that must have a value for the record to be included in the output. Multiple fields may be specified by coding the option multiple times.

  • keyword

Specifies a keyword or phrase (in quotes) that should be included in any field of the output. This performs a text search against entire records.

  • debug

Display debugging information on STDERR.

col_options

my @opts = P3Utils::col_options($batchSize);

This method returns a list of the Getopt::Long::Descriptive specifications for the common column specification options. These options are as follows.

  • col

Index (1-based) of the column number to contain the key field. If a non-numeric value is specified, it is presumed to be the value of the header in the desired column. The default is 0, which indicates the last column.

  • batchSize

Maximum number of lines to read in a batch. The default is 100.

  • nohead

Input file has no headers.

The method takes as a parameter a default batch size to override the normal default of 100.

delim_options

my @options = P3Utils::delim_options();

This method returns a list of options related to delimiter specification for multi-valued fields.

  • delim

The delimiter to use between object names. The default is ::. Specify tab for tab-delimited output, space for space-delimited output, semi for a semicolon followed by a space, or comma for comma-delimited output. Other values might have unexpected results.

delim

my $delim = P3Utils::delim($opt);

Return the delimiter to use between the elements of multi-valued fields.

  • opt

A Getopts::Long::Descriptive::Opts object containing the delimiter specification.

undelim

my $undelim = P3Utils::undelim($opt);

Return the pattern to use to split the elements of multi-valued fields.

  • opt

A Getopts::Long::Descriptive::Opts object containing the delimiter specification.

get_couplets

my $couplets = P3Utils::get_couplets($ih, $colNum, $opt);

Read a chunk of data from a tab-delimited input file and return couplets. A couplet is a 2-tuple consisting of a key column followed by a reference to a list containing all the columns. The maximum number of couplets returned is determined by the batch size. If the input file is empty, an undefined value will be returned.

  • ih

Open input file handle for the tab-delimited input file.

  • colNum

Index of the key column.

  • opt

A Getopts::Long::Descriptive::Opts object containing the batch size specification.

  • RETURN

Returns a reference to a list of couplets.

get_col

my $column = P3Utils::get_col($ih, $colNum);

Read an entire column of data from a tab-delimited input file.

  • ih

Open input file handle for the tab-delimited input file, positioned after the headers.

  • colNum

Index of the key column.

  • RETURN

Returns a reference to a list of column values.

process_headers

my ($outHeaders, $keyCol) = P3Utils::process_headers($ih, $opt, $keyless);

Read the header line from a tab-delimited input, format the output headers and compute the index of the key column.

  • ih

Open input file handle.

  • opt

Should be a Getopts::Long::Descriptive::Opts object containing the specifications for the key column or a string containing the key column name. At a minimum, it must support the nohead option.

  • keyless (optional)

If TRUE, then it is presumed there is no key column.

  • RETURN

Returns a two-element list consisting of a reference to a list of the header values and the 0-based index of the key column. If there is no key column, the second element of the list will be undefined.

find_column

my $keyCol = P3Utils::find_column($col, \@headers, $optional);

Determine the correct (0-based) index of the key column in a file from a column specifier and the headers. The column specifier can be a 1-based index or the name of a header.

  • col

Incoming column specifier.

  • headers

Reference to a list of column header names.

  • optional (optional)

If TRUE, then failure to find the header is not an error.

  • RETURN

Returns the 0-based index of the key column or undef if the header was not found.

form_filter

my $filterList = P3Utils::form_filter($opt);

Compute the filter list for the specified options.

  • opt

A Getopt::Long::Descriptive::Opts object containing the command-line options that constrain the query (--equal, --in).

  • RETURN

Returns a reference to a list of filter specifications for a call to P3DataAPI/query.

select_clause

my ($selectList, $newHeaders) = P3Utils::select_clause($p3, $object, $opt, $idFlag, \@default);

Determine the list of fields to be returned for the current query. If an --attr option is present, its listed fields are used. Otherwise, a default list is used.

  • p3

The P3DataAPI object used to access PATRIC.

  • object

Name of the object being retrieved– genome, feature, protein_family, or genome_drug.

  • opt

Getopt::Long::Descriptive::Opts object for the command-line options, including the --attr option.

  • idFlag

If TRUE, then only the ID column will be specified if no attributes are explicitly specified. and if attributes are explicitly specified, the ID column will be added if it is not present.

  • default

If specified, must be a reference to a list of field names. The named fields will be returned if no --attr option is passed in. This overrides the normal default fields.

  • RETURN

Returns a two-element list consisting of a reference to a list of the names of the fields to retrieve, and a reference to a list of the proposed headers for the new columns. If the user wants a count, the first element will be undefined, and the second will be a singleton list of count.

clean_value

my $cleaned = P3Utils::clean_value($value);

Clean up a value for use in a filter specification.

  • value

Value to clean up. Cleaning involves removing parentheses, illegal characters, and leading and trailing spaces.

  • RETURN

Returns a usable version of the incoming value.

get_data

my $resultList = P3Utils::get_data($p3, $object, \@filter, \@cols, $fieldName, \@couplets);

Return all of the indicated fields for the indicated entity (object) with the specified constraints. It should be noted that this method is simply a less-general interface to P3DataAPI/query that handles standard command-line script options for filtering.

  • p3

P3DataAPI object for accessing the database.

  • object

User-friendly name of the PATRIC object whose data is desired (e.g. genome, genome_feature).

  • filter

Reference to a list of filter clauses for the query.

  • cols

Reference to a list of the names of the fields to return from the object, or undef if a count is desired.

  • fieldName (optional)

The name of the field in the specified object that is to be used as the key field. If an all-objects query is desired, then this parameter should be omitted.

  • couplets (optional)

A reference to a list of 2-tuples, each tuple consisting of a key value followed by a reference to a list of the values from the input row containing that key value.

  • RETURN

Returns a reference to a list of tuples containing the data returned by PATRIC, each output row appended to the appropriate input row from the couplets.

get_data_batch

my $resultList = P3Utils::get_data_batch($p3, $object, \@filter, \@cols, \@couplets, $keyField);

Return all of the indicated fields for the indicated entity (object) with the specified constraints. This version differs from /get_data in that the couplet keys are matched to a true key field (the matches are exact).

  • p3

P3DataAPI object for accessing the database.

  • object

User-friendly name of the PATRIC object whose data is desired (e.g. genome, feature).

  • filter

Reference to a list of filter clauses for the query.

  • cols

Reference to a list of the names of the fields to return from the object, or undef if a count is desired.

  • couplets

A reference to a list of 2-tuples, each tuple consisting of a key value followed by a reference to a list of the values from the input row containing that key value.

  • keyfield (optional)

The key field to use. If omitted, the object’s ID field is used.

  • RETURN

Returns a reference to a list of tuples containing the data returned by PATRIC, each output row appended to the appropriate input row from the couplets.

get_data_keyed

my $resultList = P3Utils::get_data_keyed($p3, $object, \@filter, \@cols, \@keys, $keyField);

Return all of the indicated fields for the indicated entity (object) with the specified constraints. The query is by key, and the keys are split into batches to prevent PATRIC from overloading.

  • p3

P3DataAPI object for accessing the database.

  • object

User-friendly name of the PATRIC object whose data is desired (e.g. genome, feature).

  • filter

Reference to a list of filter clauses for the query.

  • cols

Reference to a list of the names of the fields to return from the object, or undef if a count is desired.

  • keys

A reference to a list of key values.

  • keyfield (optional)

The key field to use. If omitted, the object’s ID field is used.

  • RETURN

Returns a reference to a list of tuples containing the data returned by PATRIC.

script_opts

my $opt = P3Utils::script_opts($parmComment, @options);

Process the command-line options for a P3 script. This method automatically handles the --help option.

  • parmComment

A string indicating the command’s signature for the positional parameters. Used for the help display.

  • options

A list of options such as are expected by Getopt::Long::Descriptive.

  • RETURN

Returns the options object. Every command-line option’s value may be retrieved using a method on this object.

If invoked in array context, returns the options object, usage object pair so that the calling code may emit detailed usage messages if needed.

ih

my $ih = P3Utils::ih($opt);

Get the input file handle from the options. If no input file is specified in the options, opens the standard input.

  • opt

Getopt::Long::Descriptive::Opts object for the current command-line options.

  • RETURN

Returns an open file handle for the script input.

ih_options

my @opt_specs = P3Utils::ih_options();

These are the command-line options for specifying a standard input file.

  • input

Name of the main input file. If omitted and an input file is required, the standard input is used.

oh

my $oh = P3Utils::oh($opt);

Get the output file handle from the options. If no output file is specified in the options, opens the standard output.

  • opt

Getopt::Long::Descriptive::Opts object for the current command-line options.

  • RETURN

Returns an open file handle for the script output.

oh_options

my @opt_specs = P3Utils::oh_options();

These are the command-line options for specifying a standard output file.

  • output

Name of the main output file. If omitted and an input file is required, the standard output is used.

match

my $flag = P3Utils::match($pattern, $key, %options);

Test a match pattern against a key value and return 1 if there is a match and 0 otherwise. If the key is numeric, a numeric equality match is performed. If the key is non-numeric, then we have a match if any subsequence of the words in the key is equal to the pattern (case-insensitive). The goal here is to more or less replicate the SOLR eq operator.

  • pattern

The pattern to be matched. If undef, then any nonblank key matches.

  • key

The value against which to match the pattern.

  • options

Zero or more of the following keys, which modify the match.

  • exact

If TRUE, then non-numeric matches are exact.

  • RETURN

Returns 1 if there is a match, else 0.

protein_fasta

P3Utils::protein_fasta($p3, $genome, $fileName);

Create a FASTA file for the proteins in a genome.

  • p3

A P3DataAPI object for downloading from PATRIC.

  • genome

The ID of the genome whose proteins are desired.

  • fileName

The name of a file to contain the FASTA data, or an open output file handle to which the data should be written.

find_headers

my (\@headers, \@cols) = P3Utils::find_headers($ih, $fileType => @fields);

Search the headers of the specified input file for the named fields and return the list of headers plus a list of the column indices for the named fields.

  • ih

Open input file handle, or a reference to a list of headers.

  • fileType

Name to give the input file in error messages.

  • fields

A list of field names for the desired columns.

  • RETURN

Returns a two-element list consisting of (0) a reference to a list of the headers from the input file and (1) a reference to a list of column indices for the desired columns of the input, in order.

get_cols

my @values = P3Utils::get_cols($ih, $cols);

This method returns all the values in the specified columns of the next line of the input file, in order. It is meant to be used as a companion to /find_headers. A list reference can be used in place of an open file handle, in which case the columns will be used to index into the list.

  • ih

Open input file handle, or alternatively a list reference.

  • cols

Reference to a list of column indices.

  • RETURN

Returns a list containing the fields in the specified columns, in order.

get_fields

my @fields = P3Utils::get_fields($line);

Split a tab-delimited line into fields.

  • line

Input line to split, or an open file handle from which to get the next line.

  • RETURN

Returns a list of the fields in the line.

list_object_fields

my $fieldList = P3Utils::list_object_fields($p3, $object);

Return the list of field names for an object. The database schema is queried directly.

  • p3

The P3DataAPI object for accessing PATRIC.

  • object

The name of the object whose field names are desired.

  • RETURN

Returns a reference to a list of the field names.

Internal Methods

_process_entries

P3Utils::_process_entries($p3, $object, \@retList, \@entries, \@row, \@cols, $id, $keyField);

Process the specified results from a PATRIC query and store them in the output list.

  • p3

The P3DataAPI object for querying derived fields.

  • object

Name of the object queried.

  • retList

Reference to a list into which the output rows should be pushed.

  • entries

Reference to a list of query results from PATRIC.

  • row

Reference to a list of values to be prefixed to every output row.

  • cols

Reference to a list of the names of the columns to be put in the output row, or undef if the user wants a count.

  • id (optional)

Name of an ID field that should not be zero or empty. This is used to filter out invalid records.

  • keyField (optional)

Name of an ID field whose value should be put at the beginning of every output row.

_execute_query

P3Utils::_execute_query($p3, $core, $keyField, $dataField, \@keys, \%retHash, $multi);

Execute a query to get the data values associated with a key. The mapping from keys to data values is added to the specified hash.

  • p3

The P3DataAPI object for accessing the database.

  • core

The real name of the table containing the data.

  • keyField

The real name of the table’s key field.

  • dataField

The real name of the associated data field.

  • keys

A reference to a list of the keys whose data values are desired.

  • multi

If TRUE, then the related field will return multiple values.

  • retHash

A reference to a hash into which results should be placed.

_apply

my $result = _apply($function, @values);

Apply a computational function to values to produce a computed field value.

  • function

Name of the function.

  • altName

Pass the input value back unmodified.

  • concatSemi

Concatenate the sub-values using a semi-colon/space separator.

  • md5

Compute an MD5 for a DNA or protein sequence.

  • values

List of the input values.

  • RETURN

Returns the computed result.

_ec_parse

my @ecNums = P3Utils::_ec_parse($product);

Parse the EC numbers out of the functional assignment string of a feature.

  • product

The functional assignment string containing the EC numbers.

  • RETURN

Returns a list of EC numbers.

_select_list

my $fieldList = _select_list($object, $cols);

Compute the list of fields required to retrieve the specified columns. This includes the specified normal fields plus any derived fields.

  • object

Name of the object being retrieved.

  • cols

Reference to a list of field names.

  • RETURN

Returns a reference to a list of field names to retrieve.