2008年6月9日 星期一

Support Vector Learning for Ordinal Regression

In the short paper, the author proposed an order learning method based on SVM and gave some analysis of its behaviour. The problem of ordinal regression is of great importance because in many cases we are unable (or unnatural) to provide metric observations while purely classification is not enough for our needs. This condition arises most often in opinion polls where the answers are seldom in numbers but in preference classes. In my memory, this is the first time we study this problem in the class.

 

Given input space X and an ordered outcome space Y, the author first shows that the problem of learning the mapping function h {X->Y} can be formulated as finding an optimal h that minimizes pairwise ordering reversals among all the training samples over some probability distribution. The author then further shows that by maximizing the margins of different ranks one could bound the learning error even better. Finally, an algorithm with similar ideas to SVM is proposed to compute the projection axis for the rank classification. The algorithm may be adapted to non-linear cases using the standard kernel method.

 

The experiments did support the idea. Since this method built implicit ordering relation into the seperating hyperplanes, it could be anticipated to outperform the standard multi-class SVM. The concept of minimizing discrimination functions over the pairwise sample space can also be applied to other learning algorithms as well. For instance, I am wondering how it would work when AdaBoost replaces SVM. This should be quite interesting.

2008年6月2日 星期一

The PageRank Citation Ranking: Bringing Order to the Web

This paper introduces the PageRank algorithm, which is one of the central ranking estimators employed in the Google search engine. The idea of PageRank is quite simple: if one page is cited (linked) by a important page, it should likely be important as well. By putting more important pages in front of less important ones in the result list, we can effectively improve the search quality. Although I don't actually know how they combine different estimators in Google, PageRank seems to be the essential ones.

 

Although the definition of PageRank is recursive, given a network it can be computed via naive iterations when we view the pages and links as nodes and edges respectively. First, we randomly initialize the value of each page, and second, we update them by propagating those values through the links on the page. The iteration is ran to the convergence. If we interpret the normalized values of pages as probabilities, the thing that PageRank do is equivalent to computing the long term probabilities of a random walk markov chain. As argued in the paper, simple iterations converge quickly on common networks as their geometries permit.

 

The author further introduced a method to personalize the search engine. Using the thought of transition probability matrices in random walk, we could blend each row with some constant vectors that characterize personal preferences. The experiments on real data did support this idea. The other applications shown in the paper are also interesting.

2008年5月26日 星期一

Mining Association Rules between Sets of Items in Large Databases and the FP-Tree

* To graders: I just found after reading both papers that only one review is required in this week, so I condensed my comments for both papers in one post for simplicity.


In the week, we studied two papers which all addressed the association mining problem - namely, automatically deriving association relationships of items from large datasets. The first paper seems to be one of the earliest work in this field, which proposed an well-tuned algorithm based on the tree-search philosophy. The algorithm is quite efficient as shown in their experiments considering the computers at that time. The second paper further extended the idea by introducing a compressed data-structure - FP-tree, that could achieve exponential speed-up over the first algorithm and takes space proportional to the transaction dataset. Both papers gave complete analysis of their proposed methods and did thorough experiments.


Given transactions in binary vectors of the item count size, the association mining is to find item sets that appear significant times over the whole dataset. Trivially, one may simply perform a tree search on all the possible frequent item sets to solve this problem. At each node on the tree, one would go through the whole dataset to verify if the current path are still frequent or a backtrack is needed. This algorithm, however, is computationally intractable, and thus better methods are required. The first paper then proposed a series of acceleration much alike to the ones in AI-search such as probing, prunning and heuristic frontier set generation, which all work well on real data. We should note that actually generating all itemset is an EXPTIME problem, and thus the best we can do will eventually blow up when the dataset become sufficiently large.


The second paper exploit the idea that we may represent a set in a much smaller data-structure completely. Take for example, by saying we want all subsets of {1, 2, 3, 4}, we effectively "represent" 16 sets in only 4 numbers. Thus, if we could save the whole frequent itermset in a concise structure and only "decompress" a small part when we need, we may virtually avoid the implicit exponential problem. The FP-tree is simply a trie of the frequent items of each transaction. By compressing exponential subsets in stems of the tree, FP-tree can then be built efficiently and take only a small number of nodes as compared to the original dataset. In fact, I think there is a more suitable data-structure for this problem, which is called BDD. BDD (and its various extensions) could be accomplish exponential compression of tree-branches by allowing reconvergence of paths. Extending BDD for multisets might solve the association mining problem better than current approaches.

2008年5月19日 星期一

On Spectral Clustering: Analysis and an algorithm

This paper proposed an extension to the normalized-cut for K-way simutaneous segmentation. The author also studied the condition that their algorithm would work well and demostrated what the eigen-vectors do represent physically with a simple example. The algorithm is quite easy to implement too.

 

The most important contribution of the paper should be identifying the geometric relationship of the top k eigen-vectors. Specifically, given a perfect similarity kernel function (i.e. f(i, j) = 0  for Class i != Class j), the resulting row vectors of the the matrix consisting of the top k eigen-vectors would cluster k orthogonal points on a unit k-sphere one k-class data, achieving maximum separation. The practical situation, however, would not be so ideal. The similarity function is suspect to noises and the author studies this case with matrix perturbation theory, showing theoretical guarantees of their algorithm.

 

The simple experiments on a few 2D data give promising results, but whether the algorithm could work well on higher order data is a problem. Indeed, I think the largest problem of multi-class clustering would be to specify the number of classes before running. It is truly conflicting because if we already know the cluster number, we should have a good idea of partitioning. Maybe that is why often we see segmentation algorithms appear in an over-segmentation manner.

Normalized Cuts and Image Segmentation

This paper proposed a new method for image segmentation based on the spectral theory. The presented algorithm, which is called the normalized cut, had stimulated a great number of similar work in this field(EX: this one). The paper also discussed several related techniques and provides a concrete mathematical deduction of the algorithm, which all give us insightful inspiration. I came to learn the spectral methods personally in these days and found this paper quite helpful :)

 

The normalized cuts is based on an intuitive concept - if we want to cluster data properly, we should try to maximize the within-class similarity while minimizing the between-class similarity. The author proposed the N-cut measure for this idea. Although the exact solution is NP-hard, an approximation to the resulting optimization then turns out to be solving a generalized eigen-vector problem in the real domain. The normalized cut is superior to the conventional cut measure in that it would not prefer shorter boundaries. In fact, the idea of normalized cut is very similar to LDA and both methods work by solving a generalized eigen-vector problem on affinity matrices.

 

The experiments done in the paper does help us to appreciate the idea better. Especially, the experiment on 1D data shows how the other spectral methods would fail in specific cases. I also like the spring-mass analogy. It is really interesting.

2008年5月12日 星期一

A Comparison of Algorithms for Inference and Learning in Probabilistic Graphical Models

This is a quite long and tough paper. The author chose a practical computer vision problem to demonstrate several inference and learning algorithms on various graphical models. This algorithms are variants of EM, including ICM, Gibbs Sampling, Variational, Structure Variantional and Max-Product BP. Finally, an experiment were introduced on small size data. The results give us some basic ideas of the algorithms' behaviour.

 

According to the paper, graphical models can be divided into three kinds: bayesian networks, MRF and factor graphs, with the last one subsuming the first two. By representing model parameters as random variables, we may proceed many approximate learning procedures on the model with training data. The learning procedures then each have their own properties. For example, exact EM models an acceptable amount of randomness, so given a not-too-bad intialization, it could often converge to a good solution. ICM ignores any randomness and takes greedy steps at each possible update. As a result, its solution is usually the worst. Belief propagation uses a simple updating machenism, but as a contrary, it could produce satisfying results on many problems sufficiently fast.

 

The experimental result of exact EM method on the occlusion problem is truly astonishing. It is hard to believe such quality could be achieved without introducing smoothness prior or initial bg/fg priors. I am wondering how it would turn out to be if such elements were put into play. In any case, this paper does serve as an excellent reference of probabilistic inference and learning methods on graphical models.

2008年5月5日 星期一

Learning Low-Level Vision

This paper is one of the prominent research in applying MRF to computer vision problems in the early days. The author demonstrates that 3 problems (super-resolution, intrinstic image and motion estimation) can be solved efficiently through proper modelling with MRF. Although the author titled this approach as a "learning" process, the computational problem we actually solve is combinatorial optimization over a huge number of states. Such scale of search space was previously explored only possible with generic local search algorithms such as simulated annealing, GA, etc., which often returns weak local optima.

 

A Markov random field consist of two components: nodes and edges. Each node wanders among a pre-determined set of possible states, which each corresponds to a local decision with actual meaning. An energy function is then designed to encourage the nodes' decisions to suit our purpose, based on the individual decision of each node and the decisions of the two ends of each edge. That is the concept of MRF optimization.

 

Formulating problems under the MRF framework involves to find the data term and the smoothness term energy functions. The data term, sometimes called the evidence term, measures how a decision of one node is locally supported by statistics of the problem instance. The smoothness term, on the other hand, measures how one node's decision is compatible with the other one's on the common edge. The definition of both functions is nevertheless set free to the designer. For example, in this paper the author choose example patches from a large dataset for the possble states of MRF nodes. This formulation effectively quantize the huge decision space to a few most "nature" ones. The resulting MRF problem is then solved with belief propagation.

 

Strictly speaking, the answer we found through belief propagation is still local optima, just with theoretical guarantees of higher quality. The visual quality of answer, however, turns out to meet most people's need in many cases. BP is certainly not the only way we can solve MRF problems. In the recent years, a lot of new solvers were proposed, including graph cut and tree-reweighted message passing. They all provide solutions with even higher quality in more efficient ways. The remaining problem would be to find a really suitable energy for your problem, which is, after all, the fundamental diffculty in the algorithm design.