|
In the article, Peter Murray-Rust, from the
University of Cambridge, UK, and John Mitchell and Henry Rzepa from
Imperial College London, UK argue using three case studies that
conventional methods such as cutting-and-pasting chemical information
are time-consuming and introduce errors. The authors argue in favour
an open XML architecture linking to connection tables or open
databases such as PubChem, to identify chemical compounds mentioned in
the biomedical literature. This comes as additional support for open
chemical databases like the NIH's PubChem, which is currently at the
centre of a legal battle between the NIH and the American Chemical
Society (ACS). The ACS runs the very lucrative Chemical Abstracts
Service and is directly threatened by public databases.
Murray-Rust et al. explain that an open XML-based
architecture would provide a cost-effective and user-friendly way to
publish chemical information.
Such a structure would avoid the loss of data
currently 80-99% of chemical information is never published due to the
lack of a simple technical protocol to access it. It would make
chemical information easier to read, save time, and would allow
published data to be aggregated and re-used. Murray et al. recognise
that implementing such as system might take time and money and might
not be supported by all publishers. However "if publishers adopt these
tools and protocols, then the quality and quantity of chemical
information available to bioscientists will increase and the authors,
publishers and readers will find the process cost-effective", write
the authors. They add that most chemical information already exists in
electronic format in the chemists' computers and could be converted
into XML format very easily, without any loss.
Murray-Rust et al. used three recent articles
containing chemical information, and published in journals of the
BMC-series published by BioMed Central, as the basis for case studies
on the usefulness of an XML-based tool for the identification of
chemical compounds in biomedical literature.
Chemical compounds can be listed using connection
tables and associated chemical structure diagrams, but also by
structural information such as that provided by IUPAC-NIST Chemical
Identifiers (INChI). They can also be found using open semantically
free identifiers such as those provided by PubChem or based on their
common names using Open lexicons; or by systematic chemical name.
XML-based information embedded in the text of digitally published
chemistry documents could refer to one or more of these, to help
readers identify the compounds.
In their first case study, Murray-Rust et al. coded
each molecule mentioned in the article in a simple conversion protocol:
XML-based Chemical Markup Language (CML), giving the molecules their
PubChem Ids. They estimate that the entire coding process took them
the same amount of time as it would take a reader to look up the
molecules in chemical databases. In addition to the PubChem ID, CML
could contain the INChI identifier and meta-data for each molecule.
For the second article, they show that, even using an automated system,
looking for information about chemical compounds mentioned in the
article takes around 45 minutes. This could have been avoided if the
compound had been marked up and linked to connection tables and open
databases. In the third article, the name of one compound had been
misspelt and others were unclear. This made it difficult for
text-mining robots to find information about the compounds, and not
all the data needed was retrieved. |