2008年5月26日 星期一

Mining Association Rules between Sets of Items in Large Databases and the FP-Tree

* To graders: I just found after reading both papers that only one review is required in this week, so I condensed my comments for both papers in one post for simplicity.


In the week, we studied two papers which all addressed the association mining problem - namely, automatically deriving association relationships of items from large datasets. The first paper seems to be one of the earliest work in this field, which proposed an well-tuned algorithm based on the tree-search philosophy. The algorithm is quite efficient as shown in their experiments considering the computers at that time. The second paper further extended the idea by introducing a compressed data-structure - FP-tree, that could achieve exponential speed-up over the first algorithm and takes space proportional to the transaction dataset. Both papers gave complete analysis of their proposed methods and did thorough experiments.


Given transactions in binary vectors of the item count size, the association mining is to find item sets that appear significant times over the whole dataset. Trivially, one may simply perform a tree search on all the possible frequent item sets to solve this problem. At each node on the tree, one would go through the whole dataset to verify if the current path are still frequent or a backtrack is needed. This algorithm, however, is computationally intractable, and thus better methods are required. The first paper then proposed a series of acceleration much alike to the ones in AI-search such as probing, prunning and heuristic frontier set generation, which all work well on real data. We should note that actually generating all itemset is an EXPTIME problem, and thus the best we can do will eventually blow up when the dataset become sufficiently large.


The second paper exploit the idea that we may represent a set in a much smaller data-structure completely. Take for example, by saying we want all subsets of {1, 2, 3, 4}, we effectively "represent" 16 sets in only 4 numbers. Thus, if we could save the whole frequent itermset in a concise structure and only "decompress" a small part when we need, we may virtually avoid the implicit exponential problem. The FP-tree is simply a trie of the frequent items of each transaction. By compressing exponential subsets in stems of the tree, FP-tree can then be built efficiently and take only a small number of nodes as compared to the original dataset. In fact, I think there is a more suitable data-structure for this problem, which is called BDD. BDD (and its various extensions) could be accomplish exponential compression of tree-branches by allowing reconvergence of paths. Extending BDD for multisets might solve the association mining problem better than current approaches.

2008年5月19日 星期一

On Spectral Clustering: Analysis and an algorithm

This paper proposed an extension to the normalized-cut for K-way simutaneous segmentation. The author also studied the condition that their algorithm would work well and demostrated what the eigen-vectors do represent physically with a simple example. The algorithm is quite easy to implement too.

 

The most important contribution of the paper should be identifying the geometric relationship of the top k eigen-vectors. Specifically, given a perfect similarity kernel function (i.e. f(i, j) = 0  for Class i != Class j), the resulting row vectors of the the matrix consisting of the top k eigen-vectors would cluster k orthogonal points on a unit k-sphere one k-class data, achieving maximum separation. The practical situation, however, would not be so ideal. The similarity function is suspect to noises and the author studies this case with matrix perturbation theory, showing theoretical guarantees of their algorithm.

 

The simple experiments on a few 2D data give promising results, but whether the algorithm could work well on higher order data is a problem. Indeed, I think the largest problem of multi-class clustering would be to specify the number of classes before running. It is truly conflicting because if we already know the cluster number, we should have a good idea of partitioning. Maybe that is why often we see segmentation algorithms appear in an over-segmentation manner.

Normalized Cuts and Image Segmentation

This paper proposed a new method for image segmentation based on the spectral theory. The presented algorithm, which is called the normalized cut, had stimulated a great number of similar work in this field(EX: this one). The paper also discussed several related techniques and provides a concrete mathematical deduction of the algorithm, which all give us insightful inspiration. I came to learn the spectral methods personally in these days and found this paper quite helpful :)

 

The normalized cuts is based on an intuitive concept - if we want to cluster data properly, we should try to maximize the within-class similarity while minimizing the between-class similarity. The author proposed the N-cut measure for this idea. Although the exact solution is NP-hard, an approximation to the resulting optimization then turns out to be solving a generalized eigen-vector problem in the real domain. The normalized cut is superior to the conventional cut measure in that it would not prefer shorter boundaries. In fact, the idea of normalized cut is very similar to LDA and both methods work by solving a generalized eigen-vector problem on affinity matrices.

 

The experiments done in the paper does help us to appreciate the idea better. Especially, the experiment on 1D data shows how the other spectral methods would fail in specific cases. I also like the spring-mass analogy. It is really interesting.

2008年5月12日 星期一

A Comparison of Algorithms for Inference and Learning in Probabilistic Graphical Models

This is a quite long and tough paper. The author chose a practical computer vision problem to demonstrate several inference and learning algorithms on various graphical models. This algorithms are variants of EM, including ICM, Gibbs Sampling, Variational, Structure Variantional and Max-Product BP. Finally, an experiment were introduced on small size data. The results give us some basic ideas of the algorithms' behaviour.

 

According to the paper, graphical models can be divided into three kinds: bayesian networks, MRF and factor graphs, with the last one subsuming the first two. By representing model parameters as random variables, we may proceed many approximate learning procedures on the model with training data. The learning procedures then each have their own properties. For example, exact EM models an acceptable amount of randomness, so given a not-too-bad intialization, it could often converge to a good solution. ICM ignores any randomness and takes greedy steps at each possible update. As a result, its solution is usually the worst. Belief propagation uses a simple updating machenism, but as a contrary, it could produce satisfying results on many problems sufficiently fast.

 

The experimental result of exact EM method on the occlusion problem is truly astonishing. It is hard to believe such quality could be achieved without introducing smoothness prior or initial bg/fg priors. I am wondering how it would turn out to be if such elements were put into play. In any case, this paper does serve as an excellent reference of probabilistic inference and learning methods on graphical models.

2008年5月5日 星期一

Learning Low-Level Vision

This paper is one of the prominent research in applying MRF to computer vision problems in the early days. The author demonstrates that 3 problems (super-resolution, intrinstic image and motion estimation) can be solved efficiently through proper modelling with MRF. Although the author titled this approach as a "learning" process, the computational problem we actually solve is combinatorial optimization over a huge number of states. Such scale of search space was previously explored only possible with generic local search algorithms such as simulated annealing, GA, etc., which often returns weak local optima.

 

A Markov random field consist of two components: nodes and edges. Each node wanders among a pre-determined set of possible states, which each corresponds to a local decision with actual meaning. An energy function is then designed to encourage the nodes' decisions to suit our purpose, based on the individual decision of each node and the decisions of the two ends of each edge. That is the concept of MRF optimization.

 

Formulating problems under the MRF framework involves to find the data term and the smoothness term energy functions. The data term, sometimes called the evidence term, measures how a decision of one node is locally supported by statistics of the problem instance. The smoothness term, on the other hand, measures how one node's decision is compatible with the other one's on the common edge. The definition of both functions is nevertheless set free to the designer. For example, in this paper the author choose example patches from a large dataset for the possble states of MRF nodes. This formulation effectively quantize the huge decision space to a few most "nature" ones. The resulting MRF problem is then solved with belief propagation.

 

Strictly speaking, the answer we found through belief propagation is still local optima, just with theoretical guarantees of higher quality. The visual quality of answer, however, turns out to meet most people's need in many cases. BP is certainly not the only way we can solve MRF problems. In the recent years, a lot of new solvers were proposed, including graph cut and tree-reweighted message passing. They all provide solutions with even higher quality in more efficient ways. The remaining problem would be to find a really suitable energy for your problem, which is, after all, the fundamental diffculty in the algorithm design.

An introduction to graphical models

This paper introduces the basic ideas and the recent progress in graphical models. Graphical models have gained much and much more attention across the computer science society in this years, partly because they provides intuitive implementation of our thoughts on the algorithm's behaviour. Another important advantage of graphical models is that often a successful model on one problem could be implanted to many other problems as well, with a few problem specific modifications. This greatly saves the researchers' efforts.

 

Typically, graphical models consists of nodes, which represents random variables in a system, and edges, which stands for probabilistic relationship or casuality between nodes. Depends on the nature of problem, the structure of graph might contain loops, which result in fundamental difference when operating on the model. Inference and learning are the two most common operations on graphical models, where inference tries to solve problems by reasoning through the model, and learning to optimize the model given observations. Inference on loopy graphs is NP-hard. Thus, approximate inference methods is essential in many applications.

 

Famous algorithms and instances about graphical models includes MRF, HMM, EM, MCMC etc.. In fact, a brief survey of many fields in computer science would show that most of them are flooded with graphical models. For example, the recent progress of computer vision is heavily based on the success of bayesian inference and MRF. I am optimistic to see a further popularization of graphical models in the future.