Pidgin is a suite of tools that evaluate and automatically assign gene product names. There are currently three main components:
Pidgin is developed and maintained by engineers and biologists at the Broad Institute.
The project's home page is genepidgin.sourceforge.net
, and the most recent copy can always be downloaded at the project's landing page.
Suggestions are welcome; we can be reached at pidgin-support at broadinstitute dot org
.
Later on in this document we have a dedicated section for project-wide Technical Details. This section includes system requirements, installation instructions, package contents, how to verify our work on your system using an automated test framework, and license information.
Pidgin cleanup standardizes the format of gene product names derived from diverse databases, including FIGfam, KEGG, Pfam, RefSeq, SwissProt and TIGRFAM. It's the product of many years of production genome annotation, and continues to be the focus of development and refinement.
This software package consists of a large collection of heuristics, formatting rules and regular expressions which are designed to take a name from any of Pidgin's supported databases and present it in a common style. Though our regexp library is large, it is not infinite; thus, Pidgin cleanup cannot detect every possible name error. However, the vast majority of source names end up better and more informative for having gone through Pidgin cleanup.
The following list is a rough description of the steps involved in processing a name. This list is not a literal description of the layout of the code, but rather a high-level overview of how Pidgin cleanup works.
All files used as input and output are in the Simple Name File Format.
There are two closely related command-line interfaces: filter
and cleanup
.
filter
takes a name and applies the full list of filters to it. A name can be filtered to an empty string by this function; the output of the command will tell you why. Names that are filtered to nothing are ones Pidgin considers to be uninformative.
$ python pidgin.py filter <inputfile>
cleanup
applies the same filters, but respects a small set of generic names ("predicted protein", "hypothetical protein"). A gene is named "hypothetical protein" when it has high-scoring hits with unreliable names. A gene is named "predicted protein" when it has no HMMER or BLAST evidence that meets our minimum alignment thresholds.
$ python pidgin.py cleanup <inputfile>
Pidgin filter and Pidgin cleanup differ only in cases where there is insufficient evidence for a high-confidence name. In those cases, Pidgin filter will return an empty string, while Pidgin cleanup will generate a placeholder name indicating that no function is known for that protein.
Example data and usage of Pidgin cleanup are provided in the sample_data/
directory. Run
$ sh runMe.sh
to demonstrate the software.
If you don't want to see the output, there's a silent
flag.
$ python pidgin.py command [-s|--silent] <inputfile> [outputfile]
command
: Either cleanup
or filter
. See above for details.--silent
: No output to stdout.inputfile
: Source of the names to work on.outputfile
: if supplied, filtered names go here. If not, the filtered inputfile.txt
names go into a file called inputfile_bioname.txt
From inside your python shell, let's set up your first test case.
>>> import pidgin.cleaner >>> bname = pidgin.cleaner.BioName() >>> name = "BT002689 glycine/betaine/L-proline ABC transport protein, periplasmic-binding protein [Desulfovibrio desulfuricans subsp. desulfuricans str. G20]"
Instatiating BioName
compiles a hundred or so regular expressions. Instantiating a new BioName
object for every name to be changed can get expensive. A single BioName
object can reformat any number of names, so callers need only instantiate the class once.
This name contains a great deal of spurious and unreliable information. A quick cleanup
of this name...
>>> cleaned = bname.cleanup(name) >>> print cleaned "glycine/betaine/L-proline ABC transporter"
To see what happened during the filter process, we set getOutput
to true when we call cleanup
. Note the additional returned value.
>>> (cleaned, processString) = bname.cleanup(name, getOutput=1) >>> print cleaned "glycine/betaine/L-proline ABC transporter" >>> print processString filtered name in 5 steps: 0) original: BT002689 glycine/betaine/L-proline ABC transport protein, periplasmic-binding protein [Desulfovibrio desulfuricans subsp. desulfuricans str. G20] 1) reason: transport protein -> transporter pattern: \btransport(er)?\s+protein\b filtered: BT002689 glycine/betaine/L-proline ABC transporter, periplasmic-binding protein [Desulfovibrio desulfuricans subsp. desulfuricans str. G20] 2) reason: id pattern: \b[A-Za-z0-9]+\d{4,}(?<!\b(?:DUF|UPF)\d{4})\b(?!\s*(kD(a)?|-like|family|protein\s+family)) filtered: glycine/betaine/L-proline ABC transporter, periplasmic-binding protein [Desulfovibrio desulfuricans subsp. desulfuricans str. G20] 3) reason: delete spaces at beginning of name pattern: ^\s+ filtered: glycine/betaine/L-proline ABC transporter, periplasmic-binding protein [Desulfovibrio desulfuricans subsp. desulfuricans str. G20] 4) reason: delete closing brackets at end of name pattern: (?:\[[^]]*)\]\s*$ filtered: glycine/betaine/L-proline ABC transporter, periplasmic-binding protein 5) reason: delete notes after commas, dashes, semicolon--except when followed by family or superfamily pattern: [-,;]\s+(?!family)(?!superfamily).* filtered: glycine/betaine/L-proline ABC transporter
(Note that processString
is a single multiline string, which looks good when print
'ed but bad when simply exported.)
Reference the documentation in the code for more information on parameters. It's fairly well documented.
From inside your python shell or other python code, you can call on cleanup
or filter
directly. They act as described in the command-line interface section above.
For example:
>>> name = "gi|125654608|ref|YP_001033802.1| ParB-like nuclease" >>> import pidgin.cleaner >>> bname = pidgin.cleaner.BioName() >>> print bname.cleanup(name) "ParB-like nuclease"
Many people have contributed to the name cleaning logic, including: Lucia Alvarado-Balderrama¹, Sinead Chapman¹, Zehua Chen¹, Jonathan Goldberg¹, Sharvari Gujja¹, Clint Howarth¹, Chinnappa Kodira², Teena Mehta¹, Matthew Pearson¹, Narmada Shenoy¹, Tom Walk¹, Chandri Yandava¹, Qiandong Zeng¹, and the Autoannotate development team³.
¹ Broad Institute
² 454 Life Sciences
³ J. Craig Venter Institute
pidgin.cleaner.BioName
This project began life as BioName. It turns out that there already is a project named BioName. Though this BioName addresses a completely different problem, our goal is to help reduce name-related confusion. Thus we decided to change the name of our software toolkit to Pidgin. We retain the term BioName as an internal class name for source compatibility.
We are aware that there is an IM chat client called Pidgin. Rather than rename our software tookit yet again we would simply like to take this opportunity to point out that naming is a challenging problem, on many levels. We apologize for any confusion.
Pidgin compare uses a combination of edit distance and longest-common-substring calculations to estimate the degree of similarity between two or more protein names.
To compare two names, we
In more detail:
First, we split the names up by spaces, remove EC numbers and punctuation and other sorts of extra characters, convert everything to lowercase, etc.
in: "Ribosomal protein, S23-type"
out: "ribosomal" · "protein" · "s23-type"
In this step we strike out words that are only useful in a grammatical sense, including an, and, in, is, of, the, etc. We also remove weasel words, such as generic, hypothetical, related, etc. Finally, we remove glue words, such as associated, class, component, protein, system, and type. When these words are stripped we are left with a "core" name that identifies the protein; different namers may use different glue words to format the core name and we ignore those.
in: "ribosomal" · "protein" · "s23-type"
out: "ribosomal" · "s23"
Because we strip out noninformative tokens, we count all of the following strings as equal.
Finding the best edit distance between two names of, say, 4 tokens each is a bit tricky, because it's possible that the lowest cumulative edit distance will involve one or more sub-optimal individual token matches. In fact there are cases where the lowest distance is composed entirely of sub-optimal token pairings. So we need to try a lot of combinations. To do this we precompute two scores for each pair of tokens, and build two n × n matrices to hold them. We then score all possible paths with distinct pairwise token pairings via these matrices. For each path we combine two scores: we try to minimize the normalized edit distance between token pairs, and we try to maximize the length of the longest pairwise common substrings between pairs of tokens.
In one matrix, we store the pairwise token-token edit distance, using the Damerau-Levenshtein distance, leveraging the excellent Python implementation by Michael Homer. We normalize the edit distance by dividing it by the number of characters in the longer token. The other n × n matrix holds the length of the longest common substring between each pair of tokens. Our LCS finder is similar to that published on the Wikipedia.
In the case where the protein names have different numbers of tokens, we build square matrices from the largest dimension, padding the shorter dimension with empty tokens. There also are heuristics to handle cases where a token in one name is composed of two or more tokens in the other. The special handling for these special cases is too detailed for this document; see the source or contact the authors for details.
Note that token order has no effect on the distance between two names.
A perfect token-token match is really good. A lot of perfect matches are really, really good. Long common substrings are fairly good. The Damerau-Levenshtein distance can return higher distances than we might like for these three types of token matches. On the other hand, maximizing the length of the longest common substring(s) has its own set of problems. After a great deal of trial and error, we have settled on the following equation, which has worked well on genome-scale scoring studies across a variety of prokaryotes.
"Pidgin" distance = SUM(per-token normalized edit distance) * (1 - (SUM(per-token LCS length) / LENGTH(longer name))) * (1 / COUNT(compared tokens))
The first line of this distance metric weights each pair of tokens equally. Thus a "SecG" · "SecG" match counts just as much as a "phosphoribosylglycinamide" · "phosphoribosylglycinamide" match.
The second line of the metric weights each character equally, thereby lowering the distances between long tokens that differ only slightly, for example
2,3,4,5-tetrahydropyridine-2,6-dicarboxylate
2,3,4,5-tetrahydropyridine-2-carboxylate
The third line of the distance metric above simply normalizes the score from 0 to 1. A distance of 0 indicates the names have identical information content and are essentially equivalent. A distance of 1 indicates the names have nothing in common.
Given at least two input files, one reference and one or more queries, score the distance (using pidgin.distance.DistanceTool()
) between the names found in the files.
pidgin compare (options) <reference_file> <query_file> [<query_file2> ...] options: --help: this information
All input files must be in the Simple Name File Format.
This tool will create one output file per query file. The per-query output file(s) will have name(s) of the form <query_file>.compared
.
If there are multiple query files, a summary file containing the closest query match for each reference name will also be created. The summary file will be named <reference_file>.summary
.
Each line in the two-way comparison result will consist of the following tab-separated fields:
0. ID. This is the string from the first field of the entry from the reference file. 1. Score. The distance between the two names. 2. Reference name. The reference name used for the comparison. 3. Query name. The query name used for the comparison.
If a summary file is generated, each line in that file will consist of the following tab-separated fields:
0. ID. This is the string from the first field of the entry from the reference file. 1. Score. The distance between the two names. 2. Reference name. The reference name used for the comparison. 3. Best query name. The best matching query name. In cases where multiple query names scored identically, the first name with that score will appear here. (This will typically only happen for completely dissimilar names) 4. Best query source. The basename of the file which held the best query name. (ex: query_file1) In cases where multiple query names scored identically, multiple basenames will be present in this column, separated by semicolons. (ex: query_file1;query_file2)
Results are presented in the same order as in the input reference file. Names in query files that correspond to an ID not present in the reference file will be ignored. Names in the reference file with no corresponding query are scored as a complete miss (1.0). Input query and reference files may reside in any directory, but no two files may have the same basename.
The distribution in accuracy is not linear between 0.0 and 1.0; that is, after a certain level of dissimilarity it doesn't matter how much more dissimilar two names are.
The following table presents a quick guide to the interpretation of distance scores.
score | likelihood of functional match |
---|---|
=0.0 | functionally identical |
0.0 - 0.1 | excellent match |
0.1 - 0.3 | good match |
0.3 - 0.5 | possibly similar, with potentially significant distances |
0.5 - 1.0 | not generally useful |
=1.0 | completely different |
There is support for using the output of Pidgin compare directly within Python; consult pidgin/scorer.py
for details.
Pidgin select generates gene product names from alignments to proteins in curated libraries (currently FIGfam, KEGG, Pfam, RefSeq, SwissProt and TIGRFAM). Blast and hmmer alignments from those libraries are read into Pidgin via simple data formats (.pidginb
and .pidginh
, respectively), where they are sifted through to find the best name.
Sort qualifying sources, preferring: hmmer alignments to blast alignments, a lower e-value in hmmer hits, and a higher percent identity in blast hits. Walk through the sorted list until we find a name that remains informative after running through Pidgin cleanup.
Group all evidence by dest_id
and consider each dest_id
independently.
Over the course of this search, if a name filters to something uninformative (via Pidgin cleanup), then examine the next relevant source, until either a valid source and name are found, or no sources remain and the name "hypothetical protein" or "predicted protein" is assigned.
Start by examining the hmmer hits. Remove hits that are neither TIGRFAM equivalogs nor Pfam hits labeled as equivalog-equivalents by JCVI. Next, remove hits whose score is less than its family_trusted_cutoff
(see .pidginh
). Take the name of the hit with the lowest e-value. If multiple hits have equivalent e-values, select the hit with the highest bit score.
If a dest_id has no hmmer hits deemed sutable for naming, examine the blast evidence, calculating the following terms:
source_coverage = (source_stop - source_start + 1) / source_len dest_coverage = (dest_stop - dest_start + 1) / dest_len min_coverage = min(source_coverage, dest_coverage) source_pct_identity = num_identities / source_len dest_pct_identity = num_identities / dest_len min_pct_identity = min(source_pct_identity, dest_pct_identity) upper_pct_identity = max(min_pct_identity for all hits whose min_coverage ≥ 0.6) lower_pct_identity = max(0.5, upper_pct_identity - 0.05)
Cluster all hits associated with dest_id that have min_coverage
≥ 0.6 and whose min_pct_identity
is between upper_pct_identity
and lower_pct_identity
(inclusive). If upper_pct_identity
< lower_pct_identity
, ignore all hits.
If the cluster is not empty, and any of the hits in the cluster has a source_auth
(see .pidginb
) of KEGG, then select the name from the one with the highest min_pct_identity
. If there are no hits from KEGG, proceed to SwissProt hits, then FIGfam and finally RefSeq, searching in each bin for the hit with the highest min_pct_identity
within that bin.
Given a series of data files, use the selection recipe described above to determine product names for the given genes.
pidgin select (options) [inputfiles] options: -o --output : where to save files, defaults to ./pidgin_names.txt -e --etymology : where to save etymology (debug), defaults to ./pidgin_etymology.txt -h --help : this information
The format of Input and Output files are described below.
Any number of input files following the following two formats are permitted. The ordering of the files, and the ordering of the lines within the files, does not matter. No tabs, newlines, or control characters are permitted in any of these fields.
.pidginb
file formatAll files with the extension .pidginb
are assumed to contain BLAST alignments.
Each line in a .pidginb
file will consist of the following tab-separated fields:
0. dest_id STRING an identifier for a destination protein (i.e., a protein that should receive a name) 1. dest_start INTEGER 1-based index of first aligned amino acid in destination protein 2. dest_stop INTEGER 1-based index of last aligned amino acid in destination protein 3. dest_len INTEGER number of amino acids in destination protein 4. source_id STRING an identifier for a source protein (i.e., a protein whose name should be considered for assignment to the destination protein) 5. source_start INTEGER 1-based index of first aligned amino acid in source protein 6. source_stop INTEGER 1-based index of last aligned amino acid in source protein 7. source_len INTEGER number of amino acids in source protein 8. source_auth STRING the source of the data, used for heuristic processing, must be one of: - "FIGfam" - "KEGG" - "RefSeq" - "SwissProt" 9. num_identities INTEGER number of exact amino acid matches in alignment 10. num_similarities INTEGER number of similar amino acid matches in alignment 11. raw_name STRING the name of the source protein 12. comment STRING can be used for any purpose
A sample line:
7000002454063496 134 581 448 7000000120703332 127 596 470 FIGfam 151 227 FIG029094-5 IncW plasmid conjugative protein TrwB (TraD homolog)
.pidginh
file formatAll files with the extension .pidginh
are assumed to contain HMMER alignments.
Note: per-domain scores are ignored; we consider the whole hit only.
Each line in a .pidginh
file will consist of the following tab-separated fields:
0. dest_id STRING an identifier for a destination protein (i.e., a protein that should receive a name) 1. dest_start INTEGER 1-based index of first aligned amino acid in destination protein 2. dest_stop INTEGER 1-based index of last aligned amino acid in destination protein 3. dest_len INTEGER number of amino acids in destination protein 4. source_id STRING an identifier for a source family (i.e., a profile whose name should be considered for assignment to the destination protein) currently should be a TIGRFAM or Pfam id. 5. source_start INTEGER 1-based index of first aligned position in source family 6. source_stop INTEGER 1-based index of last aligned position in source family 7. source_len INTEGER number of positions in source family 8. score FLOAT score reported by hmmer 9. family_trusted_cutoff FLOAT 10. e_value FLOAT+INTEGER in the format X.XXeY where X.XX is a positive float and Y is an integer 11. raw_name STRING the name of the source family 12. comment STRING can be used for any purpose
A sample line:
7000002454071269 3 140 138 TIGRfam 13 155 143 83.519997 80.000000 -21.585027 ribosomal-protein-alanine acetyltransferase
The names of these files are governed by the option usage, as described above.
Each line of the name file has four columns:
0. dest_id STRING an identifier for a destination protein (i.e., a protein that should receive a name) 1. name STRING the best available name for the destination protein 2. source_id STRING the id of the blast or hmmer hit used to name this protein 3. comment STRING the comment field from the line used to name this protein
A snippet from a names.txt from a development run:
7000002454076078 fructose-1-6-bisphosphatase FIGfam run on library updated 2009/10/22 7000002454076081 hypothetical protein (blank) (blank)
Note that hypothetical proteins don't have the final two fields, as they did not pick up a name from the given sources.
The etymology file consists of a sequence of entries. Each entry describes the process by which the resulting name was given, showing tracking information as data is discarded and then summary information of how the name was cleaned up (plugs directly into Pidgin cleanup) before it is presented.
Entries are separated by five equals signs and a newline: =====
Each entry begins with the dest_id alone on the first line of the block. Convenient for searching!
A snippet of a local run:
7000002454076078 1 hmmer source found. 0 hmmer sources were removed due to not meeting the trusted family score. One hmmer source had a good name. Found an acceptable name in the hmmer sources. The one we liked best came from: ./test/Rho_sphaeroides_241_HMMERTRANSCRIPTS_17.pidginh:2013 This source's name was cleaned up by pidgin: filtered name in 1 step: 0) original: Fructose-1-6-bisphosphatase 1) reason: protein names should not start with a capital letter pattern: (?:(?<=similar to )|^)([A-Z])(?=[a-z][a-z]+([ /,-]|$)) filtered: fructose-1-6-bisphosphatase Final name: fructose-1-6-bisphosphatase ===== 7000002454076081 0 hmmer sources found. No name was derived from hmmer sources. 2 blast sources were found. 0 blast sources were removed by filtering for low coverage (<0.6). The highest percent identity of any remaining blast source is 0.992. The lowest is 0.945. 0 blast sources were removed due to not being within the percent identity window (0.992, 0.942). All 2 blast sources had names that filtered to nothing. No name was ultimately selected from any of the supplied sources. Final name: hypothetical protein
Python, any version from 2.5 on. You can also use alternate python implementations like jython.
Recommended packages (neither is required to use Pidgin):
http://www.nmpdr.org/FIG/wiki/view.cgi/FIG/FigFam
ftp://ftp.genome.jp/pub/kegg/genes/fasta/genes.pep
http://pfam.sanger.ac.uk/
http://www.ncbi.nlm.nih.gov/RefSeq/
http://www.expasy.ch/sprot/
ftp://ftp.jcvi.org/pub/data/TIGRFAMs/
Unpack the source to a common directory of your choice. Add that directory to your PYTHONPATH
as described in python's source doc.
pidgin.py
- command-line interface to this package,doc/pidgin.html
- this file,pidgin/
- source code, with a decent level of comments,pidgin/test/
- unit tests and the data to run them.The source code of pidgin is an assortment of interconnected libraries which can be difficult to keep straight. We use unit tests to help us verify that we don't make unintended changes while improving the product. If you decide to extend pidgin, the tests could help you in the same way.
We recommend using nose to run the tests, as it's quick and easy. To run these for yourself, just browse to your pidgin
directory and run:
nosetests -v
You should see a string of tests executed, all of which pass.
Pidgin is offered under the BSD license.
# # Copyright (c) 2009 The Broad Institute, Inc. All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted provided that the following conditions # are met: # # Redistributions of source code must retain the above copyright notice, # this list of conditions and the following disclaimer. # # Redistributions in binary form must reproduce the above copyright # notice, this list of conditions and the following disclaimer in the # documentation and/or other materials provided with the distribution. # # Neither the name of the Broad Institute nor the names of its # contributors may be used to endorse or promote products derived from # this software without specific prior written permission. # # THIS SOFTWARE IS PROVIDED BY THE BROAD INSTITUTE ''AS IS'' AND ANY # EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE # IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR # PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE BROAD INSTITUTE BE # LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR # CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF # SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR # BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, # WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE # OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, # EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. #
We try to use the same input/output format for names as much as possible throughout pidgin.
The simple name file format is a flat text file. It's human-readable and was designed with simple database interactions in mind.
Each line has three columns:
\t
)Lines beginning with #
are ignored. Any information following the second tab in a line is ignored.
An example of a simple name file:
id1 the name can be any length id2 and have any character but a newline # this line is ignored id3 this name is not ignored id4 name followed by tab this information is ignored
Pidgin was written by Clint Howarth and Matthew Pearson. Many people have contributed to the project:
Finally, thanks to the Autoannotate development team at JCVI, who were kind enough to share the source code of their naming utility with us. Seeing how hard their institute worked to reformat names motivated us to release and document our own naming logic.
We welcome your ideas! Automated gene naming is a complicated process and we'd like to see it done the best way we can.
Please drop a note to pidgin-support at broadinstitute dot org
and we'll do our best to accomodate you.
Last updated 2009-10-23