Re: CCL:XML for Bioinformtics Data



Dear Gerald
 That is a nice summary of the capabilities and possibilities offered by
 XML. Some work in this area has already been done. For more
 information on the Biosequence Markup Language (BSML)
 see the WWW page of Visual Genomics Inc. at
     http://www.visualgenomics.com/bsml/index.html
 A BSML browser and examples are available for download.
 What is not currently clear to me is whether a given markup language
 must to be approved by the WWW consortium,  the Math markup
 language 1.0 (http://www.w3.org/Math/) has been released as
 a W3C recommendation. in April 98; but is this required ?
 Gerald Loeffler wrote:
 > Hi!
 >
 > Recently, I've been working a lot with XML (see http://www.w3c.org/xml/
 > and e.g. http://www.ibm.com/xml/), which is a standard,
 human-readable,
 > extensible markup-language that is rapidly becoming _the_ method of
 > choice for exchange and storage of any kind of data and documents. It
 > seems to me that XML would simply be _perfect_ for data exchange and
 > maybe even data storage in bioinformatics (see end of message for a note
 > on chemistry and CML).
 >
 > E.g. (from the top of my head), a DNA/protein sequence similarity search
 > engine (e.g. NCBIs BLAST server) might return its search results in the
 > form of an XML document that
 > could look like this:
 >
 > <seq-sim-search-results>
 >   <query>
 >     <type>                         protein     </type>
 >     <seq name="My stupid peptide"> GAVLIFYWSTQ </seq>
 >     <algorithm>                    FASTA3      </algorithm>
 >     <db>                           SwissProt   </db>
 >     <gap-open>                    -12          </gap-open>
 >     <gap-extension>               -2           </gap-extension>
 >   </query>
 >   <hits>
 >     <hit>
 >       <accession>      HPS_HUMAN    </accession>
 >       <organism>       homo sapiens </organism>
 >       <overlap>        11           </overlap>
 >       <overlaping-seq> GAEVLFYWTDQ  </overlaping-seq>
 >       <z-score>        129.3        </z-score>
 >     </hit>
 >     <hit>
 >       <accession>      PA24_MOUSE   </accession>
 >       <organism>       mus musculus </organism>
 >       <overlap>        8            </overlap>
 >       <overlaping-seq> VFIFYWTT     </overlaping-seq>
 >       <z-score>        133.3        </z-score>
 >     </hit>
 >   </hits>
 > </seq-sim-search-results>
 >
 > There are several important points here:
 >
 > 1) Without knowing what this XML document is about, a program can assert
 > that it is well-formed! These programs exist, are free and are
 > applicable to all XML documents!
 >
 > 2) The rules for the nesting and naming of the tags in XML documents of
 > this type can be formally defined in XML. The above document would be of
 > type "seq-sim-search-results" and you could easily write a formal
 > definition (in a DTD file) that says that such a document must contain a
 > "query" and a "hits" tag; the "query" tag in
 turn must contain exactly
 > one of each "type", "seq", ... The "hits" tag
 in turn may contain 0 or
 > more "hit" tags which in turn ...
 >
 > 3) Having a formal definition of documents of this type, a program can
 > verify that our above XML document complies with the formal definiton
 > (is valid). These programs exist, are free and are applicable to all XML
 > documents!
 >
 > 4) Free utilities exist (e.g. IBMs xml4j) that can programmatically
 > write and read (parse) any XML document and thus give a program access
 > to the structure and content of the document!! (No more perl-parsers for
 > BLAST-output!!)
 >
 > 5) This file is human-readable! (in contrast to a Corba struct or a
 > serialized Java object!)
 >
 > 6) Modern WWW-browsers can (if a style-sheet is supplied) directly
 > display this XML document. For old browsers, the XML document can easily
 > be converted to HTML for display.
 >
 > I think you get the idea.
 >
 > Does such an XML-based approach sound reasonable?
 > What does this approach leave to be desired?
 > Are efforts underway in this direction?
 > Wouldn't it be a better world if we all used XML (-:
 >
 > I know that XML is currently being used for chemistry-related data (CML,
 > see http://www.xml-cml.org/), but I haven't heard of any efforts
 in the
 > area of Bioinformatics. So please view this message as targeted towards
 > the Bioinformatics community that is not served by CML. (CML has a
 > DNA/protein sequence tag.)
 >
 >         cheers,
 >         gerald
 > --
 >  Gerald Loeffler
 >  Email: Gerald.Loeffler - at - vienna.at
 >  Smail: Apollo Imaging, Marchettigasse 7, A-1060 Vienna, Austria
 >  Phone: +43 676 3289588 (+43 1 5952333 27)
 >  Fax:   +43 1 5952333 20
 >  Keywords: Java, CORBA, OOA&D, Databases, Bioinformatics,
 >            Computational Biology, Computational Biophysics
 >
 >  "Wir haben nichts zu berichten, als dass wir erbaermlich sind."
 >                                                (Thomas Bernhard)
 > -= This is automatically added to each message by mailing script =-
 > CHEMISTRY - at - ccl.net -- To Everybody    |   CHEMISTRY-REQUEST - at -
 ccl.net -- To Admins
 > MAILSERV - at - ccl.net -- HELP CHEMISTRY or HELP SEARCH
 > CHEMISTRY-SEARCH - at - ccl.net -- archive search    |    Gopher:
 gopher.ccl.net 70
 > Ftp: ftp.ccl.net  |  WWW: http://www.ccl.net/chemistry/   | Jan: jkl - at - ccl.net
 --
   Dr Mark J Forster Ph.D.
   Principal Scientist
   Informatics Laboratory
   National Institute for Biological Standards and Control
   Blanche Lane, South Mimms,
   Hertfordshire EN6 3QG, United Kingdom.
   Tel  +44 (0)1707 654753
   FAX  +44 (0)1707 646730
   E-mail  mforster - at - nibsc.ac.uk