p3-pick-by-class

Pick Records for Classification Training

p3-pick-by-class.pl [options]

This script reads an entire file into memory and collates them by the key column value. It then outputs randomly-selected records so that the number of records with each value is roughly the same.

Parameters

There are no positional parameters.

The standard input can be overridden using the options in Input Options.

Additional command-line options are those given in Column Options (to specify the key column) plus the following.

  • verbose

Display progress messages on the standard error output.

  • fuzz

Margin of error. The maximum number of records associated with any key value is number times the count of the least frequent key. The default is 1.2. This number must be between 1 and 2 inclusive.

  • max

The maximum number of data lines to output. The default () is to output as many as possible.