Please use this identifier to cite or link to this item:
https://scholarbank.nus.edu.sg/handle/10635/130038
Title: | Summarizing relational databases | Authors: | Yang, X. Procopiuc, C.M. Srivastava, D. |
Issue Date: | 2009 | Citation: | Yang, X., Procopiuc, C.M., Srivastava, D. (2009). Summarizing relational databases. Proceedings of the VLDB Endowment 2 (1) : 634-645. ScholarBank@NUS Repository. | Abstract: | Complex databases are challenging to explore and query by users unfamiliar with their schemas. Enterprise databases often have hundreds of inter-linked tables, so even when extensive documentation is available, new users must spend a considerable amount of time understanding the schema before they can retrieve any information from the database. The problem is aggravated if the documentation is missing or outdated, which may happen with legacy databases. In this paper we identify limitations of previous approaches to address this vexing problem, and propose a principled approach to summarizing the contents of a relational database, so that a user can determine at a glance the type of information it contains, and the main tables in which that information resides. Our approach has three components: First, we define the importance of each table in the database as its stable state value in a random walk over the schema graph, where the transition probabilities depend on the entropies of table attributes. This ensures that the importance of a table depends both on its information content, and on how that content relates to the content of other tables in the database. Second, we define a metric space over the tables in a database, such that the distance function is consistent with an intuitive notion of table similarity. Finally, we use a Weighted ?? -Center algorithm under this distance function to cluster all tables in the database around the most relevant tables, and return the result as our summary. We conduct an extensive experimental study on a benchmark database, comparing our approach with previous methods, as well as with several hybrid models. We show that our approach not only achieves significantly higher accuracy than the previous state of the art, but is also faster and scales linearly with the size of the schema graph. © 2009 VLDB Endowment. | Source Title: | Proceedings of the VLDB Endowment | URI: | http://scholarbank.nus.edu.sg/handle/10635/130038 | ISSN: | 21508097 |
Appears in Collections: | Staff Publications |
Show full item record
Files in This Item:
There are no files associated with this item.
Google ScholarTM
Check
Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.