We use the term "data source" instead of "database" a lot when talking about K2, and there's a good reason for this. Everyone has their own idea of the best way to store and represent data. Some rely on commercial database systems like Sybase and Oracle to store their data, some use spreadsheet programs like Excel, some simply dump them into structured (or sometimes unstructured) flat files. There are also many ways to make these data available, from customized visualization tools and standalone applications, to loosely structured Web pages and ASCII text files, to a simple direct connection to a database. Finally, data sources may reside on remote servers, requiring a distributed approach to query processing. Part of the challenge of mediation is finding ways to access all of these sources, and transforming them into a useful representation.
The primary application area for K2 to this point has been Bioinformatics. In this area there are many, many sources of data, but a very small number of them reside in traditional relational, or even object-oriented, databases. Most of the information out there is kept in "home-grown" systems that have very limited query capabilities. In addition, a lot of information can be gained by running standalone data analysis programs, and by surfing the Web. Here are some examples of the different types of data sources to which we have connected K2:
SRS - The Sequence Retrieval System from Lion Bioscience is basically a parsing and indexing system on top of a collection of flat-file "databanks" such as SwissProt and Genbank. One can set up an SRS installation on one's own system, and mirror and index the databank files locally. Lion's SRS Objects product provides API's in languages such as C++, Perl, and Java™, which allow a program to issue queries against a local SRS installation. There are also several servers worldwide that provide CGI access to their local SRS.
KEGG - The Kyoto Encyclopedia of Genes and Genomes is an attempt to combine knowledge from various genome sequencing projects to build a coherent picture of molecular interactions, in the form of metabolic and regulatory pathways. This is another "set up your own installation and mirror the files locally" scheme. KEGG maintains its databases internally as indexed flat files, over which programs called bfind and bget operate. KEGG also has a Web-based interface which essentially "HTML-izes" the results of invoking the underlying bfind and bget executables.
BLAST - The Basic Local Alignment Search Tool is a means of comparing an amino acid or nucleotide sequence against a database and retrieving sequences that are similar to the query sequence. BLAST is a standalone application that can be run through its website at NCBI, or can be installed and run locally.
Genomes - The National Center for Biotechnology Information (NCBI) at the National Library of Medicine maintains a number of useful websites, including the BLAST site mentioned above. NCBI also maintains a site which has information about fully-sequenced genomes, which provides another interesting demonstration of K2's ability to treat anything as a data source. This site has, among other things, a single page for each genome that has information about all of the genes in the genome, and their positions relative to each other.
USPTO - The United States Patent and Trademark Office maintains a website through which users can get information about U.S. patents. The full text can be searched for patents dating back to 1976.
Delphion - Formerly the IBM Intellectual Property Network, Delphion also maintains a website for retreiving patent information. In addition to U.S. Patents, Delphion has European, Japanese, and WIPO patents.
PubMed - PubMed, one of the many useful sites at NCBI, provides access to citations from MEDLINE and additional life science journals.
MMDB - Another NCBI site, the Molecular Modeling Database contains macromolecular structures obtained from the Protein Data Bank (PDB).
[K2 Front Page]