K2 and Information Integration


K2 is a system that supports distributed heterogeneous database and information resource integration. In particular the project is concerned with databases and data sources relevant to genomics and bioinformatics.



Background: Information Integration

There are a number of systems that attempt to address the problem of distributed database integration and the closely-related problem of information integration. K2 is one such system (for others, see the "Links" page on this site.)

Research in this area is driven by the observation that there are many sources of useful information available on-line. Furthermore, the value of these information sources would in many cases be greatly increased if the information they contain could be combined, "queried" in a uniform manner (i.e. using a single query language and interface), and subsequently returned in a machine-readable form.

However, the databases and information sources in question are frequently autonomous entities that are "incompatible" in a number of respects. These incompatibilities include:

  • The differing data models (e.g. relational, object-relational, object-oriented, structured document) and data formats (e.g. ASN.1, HTML, XML) used to represent data; roughly speaking, these define the semantics and syntax of the data, respectively.
  • The query language, if any, provided to access the data (e.g. standard query languages such as SQL and OQL, ad-hoc boolean query expressions, API-level programmatic interfaces, or "fill in the blank" query interfaces.)
  • The low-level mechanisms and protocols that must be used to retrieve data (e.g. FTP, HTTP/CGI, IIOP, RMI, and various proprietary formats like Sybase's TDS.)
  • And last, but certainly not least, the structure and meaning of the data itself.
The challenge in database and information integration is to overcome these incompatibilities and present a uniform interface to multiple information sources. The particular interface presented by K2 is based on the ODMG data model and uses OQL, the Object Query Language. These are standards defined in [2].



Semistructured and Structured Data

A number of information integration systems of late have focused on the integration of data published on the World-Wide Web, and for good reason: the web is by far the fastest-growing repository of public information, at least some fraction of which is likely to be useful. ([3] gives an overview of work in this area from a database perspective.) Some web integration systems have adopted an approach based on treating the available data as semistructured data. Semistructured data is data characterized by greater heterogeneity (than structured data) and the failure to adhere to an obvious schema. Intuitively, it is data whose "size" is large relative to the size of any schema that attempts to describe its structure. Of course, the line between data and metadata (i.e. schema) is fluid, but this is the whole point of semistructured data; that one has unpredictable data and hence very little reliable metadata. See [1] for more on this topic.

K2, on the other hand, takes a structured approach to information integration, including web-based integration. It has a fairly rich type system, and is therefore able to do a substantial amount of type-checking before queries are actually executed. It is also possible that queries will generate runtime type errors if K2's data sources do not adhere to their declared types. Contrast this with semistructured query languages, in which type errors are rare or nonexistent, and the semantics of queries are often defined in terms of finding those subexpressions or subgraphs in the input data that match a particular path expression in a query. Subexpressions that fail to match the query expression simply contribute nothing to the result set. Of course, structured integration is not without its disadvantages; while type-checking helps ensure that queries are valid, determining the types of new data sources is time consuming. In situations where the structure of the data is unpredictable or difficult to determine, semistructured query languages are often preferable. In fact, a semistructured query language can be enormously helpful in trying to discover the structure of a new data source.

K2 adopts the structured approach because most of the data sources we are most interested in have well-defined schemas. Even the web-based data sources we access are typically built on top of database systems of one kind or another, meaning that the HTML pages they generate are quite well-structured. Also, we feel that it is important for the system to be able to alert us of type-checking errors, because these flag situations in which:

  • A query was written incorrectly due to a programmer error. For example, if a query attempts to access a non-existent field of a relational table, K2 will report this and give a list of the valid fields.
  • The actual type of a data source does not conform to its expected type, either because our understanding of the data source was flawed, or because the underlying data source has actually changed.


Implementation Details

Central to any information integration system are the languages used to represent the contents and capabilities of information sources. For its part, K2 uses a collection-based data model with support for sets, bags, and lists (the so-called "collection" types), other complex types such as records (structs) and variants (discriminated unions), and object-oriented classes. Its data model is therefore sufficiently expressive to represent many common kinds of databases (in particular relational, object-relational, and object-oriented databases) and data sources.

K2's data model is also similar to ODL, an object-oriented data model defined as part of the ODMG's Object Database Standard [2]. This standard also defines a query language, OQL, the Object Query Language, and K2 supports OQL as its primary query language. K2 supports information integration (albeit at a relatively low level) by allowing users to pose queries against multiple information sources in OQL. K2 will optimize and evaluate such queries, using its specialized "data drivers" to communicate with the underlying information sources and retrieve the necessary data. Some degree of abstraction is possible, as OQL views can be written that integrate one or more of the underlying data sources. This corresponds to the "global as view" approach mentioned in [3].



References

[1] Peter Buneman. Semistructured Data. PODS, 117-121, 1997.

[2] R.G.G. Cattell and Douglas Barry, eds. The Object Database Standard: ODMG 2.0. Morgan Kaufmann: San Francisco, 1997.

[3] Daniela Florescu, Alon Levy, and Alberto Mendelzon. Database Techniques for the World-Wide Web: A Survey. SIGMOD Record, 27(3):59-74, 1998.