2008年5月26日 星期一

Mining Association Rules between Sets of Items in Large Databases and the FP-Tree

* To graders: I just found after reading both papers that only one review is required in this week, so I condensed my comments for both papers in one post for simplicity.


In the week, we studied two papers which all addressed the association mining problem - namely, automatically deriving association relationships of items from large datasets. The first paper seems to be one of the earliest work in this field, which proposed an well-tuned algorithm based on the tree-search philosophy. The algorithm is quite efficient as shown in their experiments considering the computers at that time. The second paper further extended the idea by introducing a compressed data-structure - FP-tree, that could achieve exponential speed-up over the first algorithm and takes space proportional to the transaction dataset. Both papers gave complete analysis of their proposed methods and did thorough experiments.


Given transactions in binary vectors of the item count size, the association mining is to find item sets that appear significant times over the whole dataset. Trivially, one may simply perform a tree search on all the possible frequent item sets to solve this problem. At each node on the tree, one would go through the whole dataset to verify if the current path are still frequent or a backtrack is needed. This algorithm, however, is computationally intractable, and thus better methods are required. The first paper then proposed a series of acceleration much alike to the ones in AI-search such as probing, prunning and heuristic frontier set generation, which all work well on real data. We should note that actually generating all itemset is an EXPTIME problem, and thus the best we can do will eventually blow up when the dataset become sufficiently large.


The second paper exploit the idea that we may represent a set in a much smaller data-structure completely. Take for example, by saying we want all subsets of {1, 2, 3, 4}, we effectively "represent" 16 sets in only 4 numbers. Thus, if we could save the whole frequent itermset in a concise structure and only "decompress" a small part when we need, we may virtually avoid the implicit exponential problem. The FP-tree is simply a trie of the frequent items of each transaction. By compressing exponential subsets in stems of the tree, FP-tree can then be built efficiently and take only a small number of nodes as compared to the original dataset. In fact, I think there is a more suitable data-structure for this problem, which is called BDD. BDD (and its various extensions) could be accomplish exponential compression of tree-branches by allowing reconvergence of paths. Extending BDD for multisets might solve the association mining problem better than current approaches.

沒有留言: