Succinct Data Structures for Data Mining

Overview

Many data mining algorithms take data provided by the user and hold it in memory to mine it. Normal representations of such data are remarkably inefficient, even when efforts are made to minimize space usage.  For example, given a collection of sets of items, the Frequent Itemset Mining problem seeks to mine sets that frequently appear as subsets of these sets. However, even a space-efficient representation of these sets, such as the FP-tree, is quite wasteful in its space usage: each node (which contains only a few bytes of data) has up to three pointers, or 24 bytes of pointer data. Consequently, the memory usage of an FP-tree is typically an order of magnitude more than the file containing itemsets provided by the user.  A similar situation arises in many other data mining tasks. This wasteful use of memory significantly inhibits the scalability of existing data mining algorithms, a problem that can be addressed by succinct data structures (SDS).

SDS represent user-provided data using an amount of computer memory close to the information-theoretic minimum and support very rapid queries and updates on the data. SDS have been shown to have very good theoretical and practical performance for a growing range of applications including text search and XML processing. There is increasing interest in applying SDS to data mining as well, and we have a number of projects under way.

People

Laith Altaee (PhD student)
Rafael Ktistakis (PhD student)
Andreas Poyias (PhD student)
Prof. Rajeev Raman

 

Share this page: