MINT and IntAct contribute to the Second BioCreative challenge: serving the text-mining community with high quality molecular interaction data

of 8

Please download to get full document.

View again

All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
PDF
8 pages
0 downs
4 views
Share
Description
MINT and IntAct contribute to the Second BioCreative challenge: serving the text-mining community with high quality molecular interaction data
Tags
Transcript
  Genome Biology   2008, 9(Suppl 2): S5 Open Access 2008Chatr-aryamontriet al. Volume 9, Suppl 2, Article S5 Research MINT and IntAct contribute to the Second BioCreative challenge: serving the text-mining community with high quality molecular interaction data  AndrewChatr-aryamontri ¤ 1 , SamuelKerrien ¤ 2 , JyotiKhadake ¤ 2 , SandraOrchard 2 , ArnaudCeol 1 , LuanaLicata 1 , LuisaCastagnoli 1 , StefanoCosta 1 , CathyDerow  2 , RachaelHuntley  2 , BrunoAranda 2 , CatherineLeroy  2 , DaveThorneycroft 2 , RolfApweiler 2 , GianniCesareni 1  and HenningHermjakob 2  Addresses: 1 Department of Biology, University of Rome, Tor Vergata, Via della Ricerca Scientifica, 00133 Rome Italy. 2 EMBL - European Bioinformatics Institute, Wellcome Trust Genome Campus, Cambridge, CB10 1SD, UK. ¤ These authors contributed equally to this work.Correspondence: GianniCesareni. Email: cesareni@uniroma2.it. HenningHermjakob. Email: hhe@ebi.ac.uk  © 2008 Chatr-aryamontri  et al  ; licensee BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the srcinal work is properly cited. Abstract Background: In the absence of consolidated pipelines to archive biological data electronically, informationdispersed in the literature must be captured by manual annotation. Unfortunately, manual annotation is timeconsuming and the coverage of published interaction data is therefore far from complete. The use of text-miningtools to identify relevant publications and to assist in the initial information extraction could help to improve theefficiency of the curation process and, as a consequence, the database coverage of data available in the literature.The 2006 BioCreative competition was aimed at evaluating text-mining procedures in comparison with manualannotation of protein-protein interactions. Results: To aid the BioCreative protein-protein interaction task, IntAct and MINT (Molecular INTeraction)provided both the training and the test datasets. Data from both databases are comparable because they werecurated according to the same standards. During the manual curation process, the major cause of data loss inmining the articles for information was ambiguity in the mapping of the gene names to stable UniProtKB databaseidentifiers. It was also observed that most of the information about interactions was contained only within thefull-text of the publication; hence, text mining of protein-protein interaction data will require the analysis of thefull-text of the articles and cannot be restricted to the abstract. Conclusion: The development of text-mining tools to extract protein-protein interaction information mayincrease the literature coverage achieved by manual curation. To support the text-mining community, databaseswill highlight those sentences within the articles that describe the interactions. These will supply data-miners witha high quality dataset for algorithm development. Furthermore, the dictionary of terms created by theBioCreative competitors could enrich the synonym list of the PSI-MI (Proteomics Standards Initiative-MolecularInteractions) controlled vocabulary, which is used by both databases to annotate their data content. Published: 01 September 2008 Genome Biology   2008, 9(Suppl 2): S5doi: 10.1186/gb-2008-9-S2-S5The electronic version of this article is the complete one and can be found online at http://genomebiology.com/2008/9/S2/S5  Genome Biology   2008, 9(Suppl 2): S5http://genomebiology.com/2008/9/S2/S5 Genome Biology 2008, Volume 9, Suppl 2, Article S5 Chatr-aryamontri et al. S5.2 Background Molecular interactions are the heart of cellular physiology,and protein-protein interactions specifically play a key role ina multitude of cellular functions, from signal transduction togene expression regulation. Thus, knowledge of the interac-tion networks of cells is fundamental to understanding theroles played by each protein in the cellular machinery. Therecent development of high-throughput methodologies forthe study of protein-protein interactions offers great promisefor the compilation of the cellular interactomes. The volumeof data thus generated requires the development of informat-ics tools for storing, querying and analyzing the data.The molecular interaction databases MINT (MolecularINTeraction) [1,2] and IntAct [3,4] were conceived for the purpose of storing experimentally verified protein-proteininteractions reported in peer-reviewed journals. Not allexperimental methods and experimental setups are equally trustworthy. For instance, some techniques, although usefulfor mapping the interaction domains, are performed in vitro ,and therefore in the absence of cellular factors that may mod-ulate the interaction; whereas for in vivo techniques the sys-tem is often perturbed in order to facilitate the detection of aninteraction. Both MINT and IntAct therefore endeavor to cap-ture a full representation of the interaction data available inthe literature to allow users to determine the reliability of aninteraction. With the aim of achieving complete literaturecoverage, the two databases (along with other major publicinteraction data providers) founded the International Molec-ular Exchange Consortium (IMEx) [5] for sharing curationefforts and for exchanging completed records on molecularinteraction data.One of the most important recent advances in interactiondata annotation is the development of the PSI-MI controlled vocabulary (CV) [6]. This was developed by the MolecularInteractions (PSI-MI) work group of the Human ProteomeOrganization Proteomics Standards Initiative (HUPO-PSI)[7] and consists of a standardized and hierarchical ontology of terms used for accurately describing interaction data. ThePSI-MI CV terms provide an in-depth description of the termto which the various synonyms used in literature can bemapped. Thus, the PSI-MI CV greatly aids consistent, unam- biguous annotation and is a boon to data exchange in severalrespects. First, it permits annotators to describe interactiondata fully without resorting to free text; this makes annota-tion faster and less error prone. Second, when applied inaccordance with the agreed standards, the CV permits seam-less exchange of data between databases, because no map-ping from one lexicon (or one set of semantic rules) toanother is required. For instance, to describe an experimentin which a GST-tagged molecule is over-expressed in aeukaryotic cell, pulled down with affinity beads, and interact-ing partners identified by mass spectrometry, curators candescribe the experiment with the most appropriate CV termsavailable. In the absence of a shared CV, databases may employ free text descriptions that can vary between individ-ual curators and databases, or have separate in-house CVsthat do not map to each other. Thus, IntAct and MINT curatedata using the PSI-MI CV terms in order to describe interac-tion data consistently. Advances in experimental techniquesfor determining and characterizing interactions are reflectedin the continual evolution of the CV. A snapshot of the hierar-chical PSI-MI CV is shown in Figure 1. The data itself is storedand disseminated in the PSI-MI 2.5 standard, an XMLexchange format [8].Deposition into public databases is a mandatory prerequisitefor publication of nucleic acid sequences, protein sequences,and protein structures. However, this is not yet the case formolecular interaction data; journals are only now starting tomake such database submission mandatory [9]. Neverthe-less, also the upload of high-throughput experiments datarequires a curation effort. Thus, the efficient extraction of molecular interaction data from already-published literatureis necessary to populate the publicly available databases. Fur-thermore, in the case of high-throughput experiments theonly way to upload the information is through a manual cura-tion of the data usually supplied as supplementary materials.To date, the only reliable way to achieve this is through man-ual curation, which is a time-consuming and laborious proc-ess. The development of effective text-mining tools couldcomplement manual curation by speeding up the informationextraction process, thus permitting increased literature cov-erage. For instance, text mining tools could facilitate the map-ping of protein interactors to their UniProtKB [10] identifiers,as well as selecting the text that best describes the interactionand matching this text to appropriate PSI-MI CV terms. How-ever, for a full and accurate description of interactions, amanual element is still required (see Challenges for automaticextraction, below).The BioCreative [11] protein-protein interaction (PPI) task addresses exactly these goals. Competitors were comparedand evaluated to determine whose methodologies would mostlikely be useful in real world scenarios, for instance as an aidto the database curators. To assist with the BioCreative PPItask, IntAct and MINT contributed both a training set fordevelopment of algorithms and a test set for objective evalua-tion of the text-mining tools. Interactions annotated from thetest set publications were not publicly released by contribut-ing databases until the BioCreative subtasks were completed.In addition, both databases provided a full description of their curation process, including the paper selection criteriaand the quality control processes used to check resultingdatabase records. Results and discussion Curation standards Syntax and semantics for data representation in MINT andIntAct are provided by the Proteomics Standards Initiative-  http://genomebiology.com/2008/9/S2/S5 Genome Biology 2008, Volume 9, Suppl 2, Article S5 Chatr-aryamontri et al. S5.3 Genome Biology   2008, 9(Suppl 2): S5 Molecular Interaction (PSI-MI 2.5) standards, as established by the PSI-MI workgroup, of which MINT and IntAct are coremembers. This workgroup develops and maintains a commondata model for the representation and exchange of interactiondata. The schema and the CVs, which allow representation of  binary and n -nary interactions, are continuously updated topermit increasingly accurate and detailed descriptions of molecular interactions. Interaction records in MINT andIntAct represent either physical interactions or co-localiza-tions (Figure 2) in accordance with the PSI-MI standards, where 'physical interactions' are defined as 'interactionsamong molecules that can be direct or indirect'. Becausegenetic interactions describe functional relationship amonggenes, they are considered distinct from physical interactions between proteins and are not currently curated by MINT andIntAct.The database curation teams strive to maintain high curationstandards. However, a comparison of publications curated by  both MINT and IntAct between the years 2003 and 2005revealed that the two databases annotated exactly the sameinteraction pairs in only 6 out of 52 publications. The discrep-ancies were due to partial curation by one or the other data- base (not all the interactions were annotated), although eachpublication was meant to be fully curated; most of these par-tial curations occurred early in the history of MINT andIntAct, when the two databases were still developing theircuration standards, including the standards for what consti-tutes adequate evidence for an interaction. Other discrepan-cies were due to mapping of an interactor to differentisoforms; to incomplete information in the manuscript (thetwo databases varied in how much additional informationthey sought from the authors); and occasionally to curationerrors. Furthermore, identical PSI-MI CV terms to describethe 'experimental methods' were used by both databases foronly nine publications, although terms from the same hierar-chical branch had been selected. This suggested that, in many cases, the choice of the method term in the ontology tree issusceptible to curator interpretation so that the same experi-mental evidence can lead to different interaction records.Thus, the adoption of the PSI-MI standards  per se is not suf-ficient to guarantee identical database records in the differentdatabases. Shared curation rules are also necessary to ensurethat the PSI-MI standards are applied consistently between An overview of the PSI-MI CV in OLS Figure 1 An overview of the PSI-MI CV in OLS. CV, controlled vocabulary; MI, Molecular Interactions; OLS, Ontology Lookup Service; PSI, Proteomics Standards Initiative.Interaction type in PSI-MI Figure 2 Interaction type in PSI-MI. MI, Molecular Interactions; PSI, Proteomics Standards Initiative.  Genome Biology   2008, 9(Suppl 2): S5http://genomebiology.com/2008/9/S2/S5 Genome Biology 2008, Volume 9, Suppl 2, Article S5 Chatr-aryamontri et al. S5.4 databases. To this aim, as of January 2006 MINT and IntActhave adopted the same curation rules as defined in the IMExcuration manual [12]. This curation manual was optimized with an inter-annotator agreement exercise performed inDecember 2005 in which the curation of five selected publica-tions performed by the different IMEx members was com-pared. As a result of the development of common curationrules, no differences were reported in either the identificationof the molecules or in the interaction detection methods. Thisinitial small number of publications is currently beingexpanded as the rules are further developed, ensuring greatercuration consistency between databases.The discrepancies in data supplied by the two databases arenot expected to affect the BioCreative PPI subtasks to thesame degree. There should have been no impact on the inter-action article subtask (IAS), because both databases agreedon which articles contained interaction data. Similarly, theinteraction method subtask (IMS) should not have beenaffected because the databases differed only in the granularity of CV terms used, and the IMS subtask mapped the methodsto the root (least granular) terms. The interaction pair sub-task (IPS) would have been the most affected because thedatabases occasionally differed in their identification of inter-actors. Other discrepancies were in fields not assessed by any subtask. Thus, the discrepancies should have had only mini-mal impact on the competition; regardless, we recommendusing data curated according to IMEx standards in futurecompetitions. IntAct/MINT databases contribution to the BioCreative training set Protein-protein interaction information extracted from arti-cles during the years 2005 and early 2006 formed the contri- bution of the IntAct and MINT databases to the trainingdataset. There was no preselection of particular journals within this set. All interactions meeting curation standards were annotated, and the data were made available to the Bio-Creative organizers in PSI-MI XML2.5 format. Each interac-tion in the 'training set' has been fully represented, includingexperimental features such as interaction detection method,participant identification method, post-translational modifi-cations, mutations affecting the interaction, and bindingranges. For the purposes of this competition, the BioCreativeparticipants used only the XML fields reporting the proteinidentifiers and the experimental detection method. IntAct/MINT database contribution to the BioCreative test set The MINT test set was composed of protein-protein interac-tions extracted from articles published in  FEBS Letters ,  EMBO Journal  , and  EMBO Reports  between January 2006and July 2006. The IntAct test set was composed of protein-protein interactions extracted from articles of  Journal of Bio-logical Chemistry (JBC) and journals belonging to the  Nature group of publications published in 2006 (Table 1). As an additional task for the compilation of the BioCreativetest set, the curators were asked to identify and report thesentence best describing each interaction from the perusal of either the abstract or the full-text article. In 20% of the casesit was not possible to identify a sentence describing thecurated interaction (Table 2). Challenges for automatic extraction of protein interaction data by text mining Here we describe a list of potential problems in the curationprocess that might affect BioCreative predictions. These arepitfalls that could affect any manual or text mining effort toextract interaction data, potentially leading to inaccurate orincomplete description of information contained in articles.  Missing UniProtKB mappings The major cause of loss of data results from the difficulty inmapping the gene names or identifiers described in the articleto UniProtKB entries. This was due to an ambiguous descrip-tion of the gene name, species, subtype of the protein in ques-tion, or more rarely the absence of a UniProtKB entry for themolecule involved in the interaction. A typical case is whenauthors mention that they used the mammalian protein with-out specifying whether they are referring to, for example, themouse or the human form. In other cases the authors mentionthe use of a multisubunit protein not specifying the subunit.This issue has been recently addressed in the MIMIx (Molec-ular Interaction Experiment) recommendations [13]. Interactions cannot be mined from abstracts It is not always possible to identify a single sentence thatclearly describes an interaction reported in a paper. In many cases the evidence that a paper is eligible for curation is dis-persed throughout multiple sentences in the full-text articleor may only be in figure or table legends. Nevertheless, cura-tors can clearly identify and extract an interaction from a fig-ure or a table, even if there is no sentence explicitly reportingthat interaction in the text. For instance, positive controls arenot usually cited in the text and interactions from high-throughput experiments are reported in tables. False positives derived from ambiguous terms For text-miners the presence of the word 'interaction' in thetext directly points to an interaction. Unfortunately, the'interaction' can refer to experiments describing geneticinteractions that are not curated by MINT/IntAct, to drug- Table 1MINT and IntAct contribution to the test-set Count of publicationsCount of interactionsMINT2211,520IntAct154951Total3752,471  http://genomebiology.com/2008/9/S2/S5 Genome Biology 2008, Volume 9, Suppl 2, Article S5 Chatr-aryamontri et al. S5.5 Genome Biology   2008, 9(Suppl 2): S5 drug interactions, or to other data irrelevant to MINT/IntAct.In other cases there is no experimental evidence supportingthe interaction that is based only on authors' assumptions.Interactions may also be described based on predictions ormodel building; these do not constitute physical interactionsor co-localization and are not curated by either of thedatabases. Interactions mediated by complexes Interactions between protein complexes (for example, Pol II)and proteins are not considered by MINT/IntAct curators. Inthese cases, the interactions detected by the text-mining tool will not find any match in MINT/IntAct records. Contribution to text-mining community If text mining tools can accurately identify sentences or pas-sages within articles that are indicative of molecular interac-tions, then they can potentially facilitate manual curation by prescreening the literature. We therefore provided the Bio-Creative competitors with examples of such sentences andpassages. An annotation topic 'source-text' was introduced in MINTand IntAct. MINT datasets are downloadable from the MINTFTP site [14]. Furthermore, IntAct has continued to extractthe interaction sentences; currently, 3,463 sentences areavailable for 529 publications. IntAct has also introduced anannotation topic 'dataset' with the description 'BioCreative -Critical Assessment of Information Extraction Systems inBiology' to identify the entries that contributed to the BioCre-ative test set. Both the extracted interaction sentences and thedataset curated for BioCreative competition are available fordownload from IntAct FTP site [15,16]. The normalized pro- tein interaction sentences generated from the BioCreative ini-tiative were then made available by the organizers forsubsequent assessment by database curators. Text-mining and the development of the PSI-MI controlled vocabulary The PSI-MI CV provides a consistent set of terms used toannotate the interaction data. The vocabulary is continuously updated to assimilate newer and more sophisticated tech-niques. Synonyms, definition, and literature reference foreach term are stored within the CV to assist the user in select-ing the appropriate term. The dictionary of synonyms devel-oped by the text-mining community, both during thecompetition and in the future, could be incorporated into thePSI-MI CVs and thereby greatly enhance the CVs.Manual curation is laborious and it is extremely difficult toquantify the required amount of time to complete the cura-tion of each article; the process of curating a single paper cantake up to 1 day of a trained curator's time, much of which isconsumed in adding significant value to the interactions. Ini-tial identification of the interactors and interaction techniqueis followed by an in-depth analysis of the interactors and theinteractions. The PSI-MI CV is used extensively to define anddescribe the interactors and interactions. InterPro signatures[17] and Gene Ontology terms [18] are also used to provide richer interactor annotation to users. The additional stepsensure full and accurate data representation.IntAct and MINT are currently investigating the possibility of integrating text-mining applications into their curation envi-ronment. This is being done at two levels. The first is identi-fying the publications describing interactions involving agiven set of proteins (IAS) [19,20]. The second allows for pre- analysis of full-text publications for unambiguous mapping toUniProtKB entries and identification of the interaction detec-tion method involved (IPS). This analysis can be stored inPSI-MI XML as preliminary data and then be used by a cura-tor to perform the exhaustive annotation of the publication.The results of the full curation are then used to enhance fur-ther the tool by indicating which of the predicted interactions were right and wrong. Here again, the PSI-MI XML is used topropagate the feedback to the text-miners. Conclusion MINT and IntAct provide high quality and well documentedinteraction data from the literature using controlled vocabu-laries, which reduce the ambiguity in the naming of the tech-niques and interpretation of interaction features. This isachieved through careful manual curation by highly qualifiedcurators. However, as both the volume of literature and thenumber of proteins requiring characterization increases, themanual processing capability is soon saturated. Semi-auto-mated assistance would thus greatly expedite the curationprocess. Text-mining in the biomedical domain is receivingincreasing attention. To aid and encourage the developmentof such tools, the IntAct team at the European BioinformaticsInstitute and the MINT team at the University of Rome Tor Table 2Interactions with annotated sentence Count of interactionsCount of interactions with sentence%MINT1,5201,17677IntAct95180184Total2,4711,97780
Related Search
Similar documents
View more...
Advertisements
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks