Genome Feature Data
Data Type: genome_feature
Primary Key: feature_id
Attributes
-
aa_length
(integer)
- Number of amino-acid residues in the translated product of a CDS. Used in quality checks, alignment trimming, and size filters. 328
-
aa_sequence_md5
(string)
- 32-char MD5 hash for the amino-acid sequence; enables rapid duplicate detection without shipping the full sequence. "d41d8cd98f00b204e9800998ecf8427e"
-
accession
(string)
- GenBank/RefSeq accession of the replicon on which the feature resides. "NC_000913.3"
-
alt_locus_tag
(string)
- Historic or secondary locus tag from earlier annotations; improves cross-version mapping. "b0001_old"
-
annotation
(string)
- Label such as PATRIC, RASTtk, RefSeq; lets users compare pipelines. "PATRIC"
-
brc_id
(string)
- Internal monotonically increasing integer (string-typed) uniquely identifying the feature across clusters. "126547189"
-
classifier_round
(integer)
- Training round number used to generate the current classifier_score; supports auditing. 3
-
classifier_score
(number)
- Confidence (0–1) that the CDS is a true protein-coding gene according to BV-BRC’s machine-learning model. 0.97
-
codon_start
(integer)
- 1, 2 or 3 value sent to GenBank / tbl2asn to mark the first coding frame relative to feature start. 1
-
date_inserted
(date)
- ISO-8601 UTC date when the feature row entered BV-BRC. "2023-04-12T15:41:22Z"
-
date_modified
(date)
- Updated any time an attribute (location, product name, etc.) changes; drives incremental exports. "2025-05-03T07:18:51Z"
-
end
(integer)
- 1-based inclusive coordinate on the replicon. 40365
-
feature_id
*
(string)
- Stable identifier (`fig taxon.replicon.peg.Nfor CDS;.rna., .repeat.` etc. for others).
-
feature_type
(string)
- Controlled values: CDS, rRNA, tRNA, misc_feature, repeat_region, etc. "CDS"
-
figfam_id
(string)
- Protein family ID from legacy FIGfam scheme; enables compatibility with older PATRIC tools. "FIG00012345"
-
gene
(case insensitive string)
- Short locus name (recA, rpoB). Case-insensitive for search. "recA"
-
gene_id
(number)
- Numeric GeneID from NCBI Gene when available. 947015
-
genome_id
(string)
- Foreign key linking to the genome metadata record. "511145.183"
-
genome_name
(case insensitive string)
- De-normalised for quick display. "Escherichia coli K-12 MG1655"
-
go
(array of case insensitive strings)
- List of Gene Ontology identifiers assigned to the protein; supports functional enrichment. [ "GO:0003677", "GO:0006310" ]
-
location
(string)
- Concise PATRIC location format contig_start+len or contig_start-len (strand encoded by sign). "NC_000913.3_190..1188"
-
na_length
(integer)
- Feature span in nucleotides. 999
-
na_sequence_md5
(string)
- Hash of the genomic DNA sequence underlying the feature; speeds up variant detection. "0cc175b9c0f1b6a831c399e269772661"
-
notes
(array of strings)
- Free-text remarks (e.g. “pseudogene fragment”, “frameshift at pos 678”). [ "frameshift at 675-677" ]
-
og_id
(string)
- Pan-domain orthologous group (e.g. eggNOG) used for cross-kingdom analyses. "COG0468"
-
owner
(string)
- BV-BRC user or group that controls write access. "patric_public"
-
p2_feature_id
(number)
- Numeric key from retired schema; retained for backward compatibility. 21987654
-
patric_id
(string)
- Historical alias identical to feature_id for CDS; kept to avoid breaking old APIs. `"fig
-
pdb_accession
(array of strings)
- List of matching PDB entries for the protein. [ "1A2B", "6VSB" ]
-
pgfam_id
(string)
- Global protein family ID (cross-genus, length-normalized clustering). "PGF_00001234"
-
plfam_id
(string)
- Local (within-genus) protein family ID; finer granularity than PGFam. "PLF_1234567"
-
product
(case insensitive string)
- Curated or predicted functional description; shown in browsers and BLAST. "DNA repair protein RecA"
-
property
(array of strings)
- Extra structured qualifiers (e.g. signal_peptide, transmembrane). [ "transmembrane", "lipoprotein" ]
-
protein_id
(string)
- INSDC / RefSeq protein accession for the translated product. "WP_000011355.1"
-
public
(boolean)
- true if the feature is visible to all users; false for private genomes. True
-
refseq_locus_tag
(string)
- Official NCBI locus tag for RefSeq annotation sets. "ECOLI_RS00001"
-
segments
(array of strings)
- For spliced genes: array of start..end regions; used by GFF exporters. [ "190..350", "500..1188" ]
-
sequence_id
(string)
- Links to genome_sequence.sequence_id (chromosome, plasmid, or viral segment). "NC_000913.3"
-
sog_id
(string)
- Finer split of og_id based on bidirectional best hits; helps detect recent duplications. "SOG_987654"
-
start
(integer)
- 1-based inclusive start coordinate. 190
-
strand
(string)
- "+" or "-"; required for translation and plotting. "+"
-
taxon_id
(integer)
- Taxon of the parent genome; used for taxon-restricted searches. 562
-
uniprotkb_accession
(string)
- Primary accession mapping to UniProt for functional & structural metadata. "P0A7V8"
-
user_read
(array of strings)
- BV-BRC user/org IDs allowed to view the record (includes "public" for open data). [ "public" ]
-
user_write
(array of strings)
- User/org IDs permitted to edit the record. [ "maulik@bvbrc.org" ]
API
GET :feature_id
Retrieve a genome_feature data object by feature_id
EXAMPLE
https://www.bv-brc.org/api/genome_feature/RefSeq.1001732.3.AKUQ01000008.CDS.655540.656001.rev
Try It!
QUERY :query
Query for genome_feature data objects with an RQL Query
Return Formats
Requests may include an HTTP ACCEPT header from this list to transform the data into the requested type.
-
application/json - Returns results as an array of JSON objects
-
application/solr+json - Results results in SOLR JSON response format
-
text/csv - Returns results in Comma Separated values (CSV) format. Columns are separated by ','. Multi-value columns are separated by ';'. Rows are separated by new line
-
text/tsv - Returns results in Tab Separated values (TSV) format. Columns are separated by a tab. Multi-value columns are separated by ';'. Rows are separated by new line
-
application/vnd.openxmlformats - Returns objects as an MS Excel document
-
application/dna+fasta - Returns DNA sequences for queries in FASTA format
-
application/protein+fasta - Returns Protein sequences for queries in FASTA format
-
application/dna+jsonh+fasta - Returns DNA sequences for queries in JSONH-FASTA format
-
application/protein+jsonh+fasta - Returns Protein sequences for queries in JSONH-FASTA format
-
application/gff - Returns a genomic features in GFF format
EXAMPLES
- Query for genome_feature data objects with a feature_id equal to RefSeq.1001732.3.AKUQ01000008.CDS.655540.656001.rev. Return results as a JSON Array.
https://www.bv-brc.org/api/genome_feature/?eq(feature_id,RefSeq.1001732.3.AKUQ01000008.CDS.655540.656001.rev)
Try It!
- Query for genome features for genome 90370.851, limit to 5 sequences. Return JSON data.
https://www.bv-brc.org/api/genome_feature/?eq(genome_id,90370.851)&limit(5)
Try It!
- Query for genome features for genome 90370.851 with PATRIC Annotation, limit to 5 sequences. Return DNA Fasta.
https://www.bv-brc.org/api/genome_feature/?and(eq(annotation,PATRIC),eq(genome_id,90370.851))&limit(5)&http_accept=application/dna+fasta
Try It!