2008年6月9日 星期一

Support Vector Learning for Ordinal Regression

In the short paper, the author proposed an order learning method based on SVM and gave some analysis of its behaviour. The problem of ordinal regression is of great importance because in many cases we are unable (or unnatural) to provide metric observations while purely classification is not enough for our needs. This condition arises most often in opinion polls where the answers are seldom in numbers but in preference classes. In my memory, this is the first time we study this problem in the class.

 

Given input space X and an ordered outcome space Y, the author first shows that the problem of learning the mapping function h {X->Y} can be formulated as finding an optimal h that minimizes pairwise ordering reversals among all the training samples over some probability distribution. The author then further shows that by maximizing the margins of different ranks one could bound the learning error even better. Finally, an algorithm with similar ideas to SVM is proposed to compute the projection axis for the rank classification. The algorithm may be adapted to non-linear cases using the standard kernel method.

 

The experiments did support the idea. Since this method built implicit ordering relation into the seperating hyperplanes, it could be anticipated to outperform the standard multi-class SVM. The concept of minimizing discrimination functions over the pairwise sample space can also be applied to other learning algorithms as well. For instance, I am wondering how it would work when AdaBoost replaces SVM. This should be quite interesting.

2008年6月2日 星期一

The PageRank Citation Ranking: Bringing Order to the Web

This paper introduces the PageRank algorithm, which is one of the central ranking estimators employed in the Google search engine. The idea of PageRank is quite simple: if one page is cited (linked) by a important page, it should likely be important as well. By putting more important pages in front of less important ones in the result list, we can effectively improve the search quality. Although I don't actually know how they combine different estimators in Google, PageRank seems to be the essential ones.

 

Although the definition of PageRank is recursive, given a network it can be computed via naive iterations when we view the pages and links as nodes and edges respectively. First, we randomly initialize the value of each page, and second, we update them by propagating those values through the links on the page. The iteration is ran to the convergence. If we interpret the normalized values of pages as probabilities, the thing that PageRank do is equivalent to computing the long term probabilities of a random walk markov chain. As argued in the paper, simple iterations converge quickly on common networks as their geometries permit.

 

The author further introduced a method to personalize the search engine. Using the thought of transition probability matrices in random walk, we could blend each row with some constant vectors that characterize personal preferences. The experiments on real data did support this idea. The other applications shown in the paper are also interesting.

2008年5月26日 星期一

Mining Association Rules between Sets of Items in Large Databases and the FP-Tree

* To graders: I just found after reading both papers that only one review is required in this week, so I condensed my comments for both papers in one post for simplicity.


In the week, we studied two papers which all addressed the association mining problem - namely, automatically deriving association relationships of items from large datasets. The first paper seems to be one of the earliest work in this field, which proposed an well-tuned algorithm based on the tree-search philosophy. The algorithm is quite efficient as shown in their experiments considering the computers at that time. The second paper further extended the idea by introducing a compressed data-structure - FP-tree, that could achieve exponential speed-up over the first algorithm and takes space proportional to the transaction dataset. Both papers gave complete analysis of their proposed methods and did thorough experiments.


Given transactions in binary vectors of the item count size, the association mining is to find item sets that appear significant times over the whole dataset. Trivially, one may simply perform a tree search on all the possible frequent item sets to solve this problem. At each node on the tree, one would go through the whole dataset to verify if the current path are still frequent or a backtrack is needed. This algorithm, however, is computationally intractable, and thus better methods are required. The first paper then proposed a series of acceleration much alike to the ones in AI-search such as probing, prunning and heuristic frontier set generation, which all work well on real data. We should note that actually generating all itemset is an EXPTIME problem, and thus the best we can do will eventually blow up when the dataset become sufficiently large.


The second paper exploit the idea that we may represent a set in a much smaller data-structure completely. Take for example, by saying we want all subsets of {1, 2, 3, 4}, we effectively "represent" 16 sets in only 4 numbers. Thus, if we could save the whole frequent itermset in a concise structure and only "decompress" a small part when we need, we may virtually avoid the implicit exponential problem. The FP-tree is simply a trie of the frequent items of each transaction. By compressing exponential subsets in stems of the tree, FP-tree can then be built efficiently and take only a small number of nodes as compared to the original dataset. In fact, I think there is a more suitable data-structure for this problem, which is called BDD. BDD (and its various extensions) could be accomplish exponential compression of tree-branches by allowing reconvergence of paths. Extending BDD for multisets might solve the association mining problem better than current approaches.

2008年5月19日 星期一

On Spectral Clustering: Analysis and an algorithm

This paper proposed an extension to the normalized-cut for K-way simutaneous segmentation. The author also studied the condition that their algorithm would work well and demostrated what the eigen-vectors do represent physically with a simple example. The algorithm is quite easy to implement too.

 

The most important contribution of the paper should be identifying the geometric relationship of the top k eigen-vectors. Specifically, given a perfect similarity kernel function (i.e. f(i, j) = 0  for Class i != Class j), the resulting row vectors of the the matrix consisting of the top k eigen-vectors would cluster k orthogonal points on a unit k-sphere one k-class data, achieving maximum separation. The practical situation, however, would not be so ideal. The similarity function is suspect to noises and the author studies this case with matrix perturbation theory, showing theoretical guarantees of their algorithm.

 

The simple experiments on a few 2D data give promising results, but whether the algorithm could work well on higher order data is a problem. Indeed, I think the largest problem of multi-class clustering would be to specify the number of classes before running. It is truly conflicting because if we already know the cluster number, we should have a good idea of partitioning. Maybe that is why often we see segmentation algorithms appear in an over-segmentation manner.

Normalized Cuts and Image Segmentation

This paper proposed a new method for image segmentation based on the spectral theory. The presented algorithm, which is called the normalized cut, had stimulated a great number of similar work in this field(EX: this one). The paper also discussed several related techniques and provides a concrete mathematical deduction of the algorithm, which all give us insightful inspiration. I came to learn the spectral methods personally in these days and found this paper quite helpful :)

 

The normalized cuts is based on an intuitive concept - if we want to cluster data properly, we should try to maximize the within-class similarity while minimizing the between-class similarity. The author proposed the N-cut measure for this idea. Although the exact solution is NP-hard, an approximation to the resulting optimization then turns out to be solving a generalized eigen-vector problem in the real domain. The normalized cut is superior to the conventional cut measure in that it would not prefer shorter boundaries. In fact, the idea of normalized cut is very similar to LDA and both methods work by solving a generalized eigen-vector problem on affinity matrices.

 

The experiments done in the paper does help us to appreciate the idea better. Especially, the experiment on 1D data shows how the other spectral methods would fail in specific cases. I also like the spring-mass analogy. It is really interesting.

2008年5月12日 星期一

A Comparison of Algorithms for Inference and Learning in Probabilistic Graphical Models

This is a quite long and tough paper. The author chose a practical computer vision problem to demonstrate several inference and learning algorithms on various graphical models. This algorithms are variants of EM, including ICM, Gibbs Sampling, Variational, Structure Variantional and Max-Product BP. Finally, an experiment were introduced on small size data. The results give us some basic ideas of the algorithms' behaviour.

 

According to the paper, graphical models can be divided into three kinds: bayesian networks, MRF and factor graphs, with the last one subsuming the first two. By representing model parameters as random variables, we may proceed many approximate learning procedures on the model with training data. The learning procedures then each have their own properties. For example, exact EM models an acceptable amount of randomness, so given a not-too-bad intialization, it could often converge to a good solution. ICM ignores any randomness and takes greedy steps at each possible update. As a result, its solution is usually the worst. Belief propagation uses a simple updating machenism, but as a contrary, it could produce satisfying results on many problems sufficiently fast.

 

The experimental result of exact EM method on the occlusion problem is truly astonishing. It is hard to believe such quality could be achieved without introducing smoothness prior or initial bg/fg priors. I am wondering how it would turn out to be if such elements were put into play. In any case, this paper does serve as an excellent reference of probabilistic inference and learning methods on graphical models.

2008年5月5日 星期一

Learning Low-Level Vision

This paper is one of the prominent research in applying MRF to computer vision problems in the early days. The author demonstrates that 3 problems (super-resolution, intrinstic image and motion estimation) can be solved efficiently through proper modelling with MRF. Although the author titled this approach as a "learning" process, the computational problem we actually solve is combinatorial optimization over a huge number of states. Such scale of search space was previously explored only possible with generic local search algorithms such as simulated annealing, GA, etc., which often returns weak local optima.

 

A Markov random field consist of two components: nodes and edges. Each node wanders among a pre-determined set of possible states, which each corresponds to a local decision with actual meaning. An energy function is then designed to encourage the nodes' decisions to suit our purpose, based on the individual decision of each node and the decisions of the two ends of each edge. That is the concept of MRF optimization.

 

Formulating problems under the MRF framework involves to find the data term and the smoothness term energy functions. The data term, sometimes called the evidence term, measures how a decision of one node is locally supported by statistics of the problem instance. The smoothness term, on the other hand, measures how one node's decision is compatible with the other one's on the common edge. The definition of both functions is nevertheless set free to the designer. For example, in this paper the author choose example patches from a large dataset for the possble states of MRF nodes. This formulation effectively quantize the huge decision space to a few most "nature" ones. The resulting MRF problem is then solved with belief propagation.

 

Strictly speaking, the answer we found through belief propagation is still local optima, just with theoretical guarantees of higher quality. The visual quality of answer, however, turns out to meet most people's need in many cases. BP is certainly not the only way we can solve MRF problems. In the recent years, a lot of new solvers were proposed, including graph cut and tree-reweighted message passing. They all provide solutions with even higher quality in more efficient ways. The remaining problem would be to find a really suitable energy for your problem, which is, after all, the fundamental diffculty in the algorithm design.

An introduction to graphical models

This paper introduces the basic ideas and the recent progress in graphical models. Graphical models have gained much and much more attention across the computer science society in this years, partly because they provides intuitive implementation of our thoughts on the algorithm's behaviour. Another important advantage of graphical models is that often a successful model on one problem could be implanted to many other problems as well, with a few problem specific modifications. This greatly saves the researchers' efforts.

 

Typically, graphical models consists of nodes, which represents random variables in a system, and edges, which stands for probabilistic relationship or casuality between nodes. Depends on the nature of problem, the structure of graph might contain loops, which result in fundamental difference when operating on the model. Inference and learning are the two most common operations on graphical models, where inference tries to solve problems by reasoning through the model, and learning to optimize the model given observations. Inference on loopy graphs is NP-hard. Thus, approximate inference methods is essential in many applications.

 

Famous algorithms and instances about graphical models includes MRF, HMM, EM, MCMC etc.. In fact, a brief survey of many fields in computer science would show that most of them are flooded with graphical models. For example, the recent progress of computer vision is heavily based on the success of bayesian inference and MRF. I am optimistic to see a further popularization of graphical models in the future.

2008年4月29日 星期二

Combined Labeled and Unlabeled Data with Co-Training

This paper proposed the idea of co-training: use inexpensive unlabeled data for improving initially weak classifiers. The weak classfiers are firstly trained by a few labeled examples to achieve coarse discrimination. It then repeatedly selects some instances from the unlabeled data pool, labels them and re-trains itself using these newly labeled data along with the ones it has labeled before. This approach can dramatically improve the performance of classifiers as it is learnt on a larger set of "training data".

 

The working situation of this approach requires that each example contains two kinds of (but not limited to) independent cues for classification. Furthermore, the author assumes that both cues are consistent in their decision, i.e., take f1 and f2 as the classifiers for cues x1 and x2, f1(x1) should equal f2(x2) at probability close to 1. These two conditions allow cues of one kind to be connected ( and thus strengthened) through "bridging" on the other kind. For example, assume observation 1 contains feature A and B and observation 2 contains contains feature C and D in the first and second kind respectively. Given a new observation with feature C and B, we may refer all the three observations to the same class as we believe both cues are in agreement. This forms the basis of co-training.

 

The experiments on real data in the paper did support the idea of co-training. Nevertheless, I think one problem here is whether we should put all the work to the learning algorithm. If we do get two kinds of feature for each example, why don't we first perform dimension reduction such like LDA so that the examples of same class would cluster together. In this way, maybe the so-called connected component sub-structure could be discovered before any learning algorithm is done. I would like to know if anyone has done research on such topics.

2008年4月14日 星期一

[Paper Review] Improved Boosting Algorithms Using Confidence-rated Predictions

This paper is a complete arrangement of the original AdaBoost algorithm. It discussed the algorithm's behavior in many aspects, gave theoretical bounds and proposed several improvements. The author also validated those new ideas with experiments on real data. The new, improved AdaBoost is more convenient as one can assign real-valued responses in the weak learners.

 

The most attractive feature of AdaBoost should be its simplicity. Given an almost arbitrary configurable classifier, one can enhance its performance considerably by plainly following those simple steps of AdaBoost. The original classifier can be very crude, maybe only slightly better than random guess. In the training progress, AdaBoost will force the weak classifier to "learn" more from the examples it failed before, thus increasing the accuracy of the classifier. The training error of AdaBoost approaches zero at an exponential rate of iterations. The final classifier is a linear combination of all the classifiers used in the progress. This voting structure is a compromise of classifiers, each looks at a small portion of the high dimensional instance space.  For multi-label or multi-class problems, one can map them to the binary case by forming tuples of "instance-label" or "instance-class" and exploring their validity.

 

One obvious weakness of AdaBoost is the neglect of the interaction among individual classifiers. Much likely, it would be the case one could assure a prediction with high confidence if specific two or three classifiers say "yes" at the same time. The linear combination rule of AdaBoost puts the burden on the adaptability of the weak learner, which may not always be a good idea. Another extension of AdaBoost might be to modify it so that inherently different kinds of classifiers can work together.

[Paper Review] Rapid Object Detection using a Boosted Cascade of Simple Features

This paper showed how to apply machine learning techniques in the area of high-speed object detection (or more precisely, template matching). The idea is to choose one kind of coarse features that can be computed very efficiently and to improve their performance by AdaBoost learning and filter cascading. The proposed system can run in real-time in face-detection with comparable accuracy to the state-of-the-art. 

 

Apparently, the most important contribution of this work is in extending AdaBoost so that one can choose from a large pool of available features to form a small, yet equivalently discriminative subset. The extension is not complex, however - you only need to repeat the training in each iteration for all available classifiers (features) instead and select the most promising one. The other two ideas of this work are more straightforward. The integral image is a synonym of one old trick in dynamic programming, and filter cascading is something everyone would possibly do in high-speed application. Nevertheless, the resulting system does demonstrate a great improvement over existing methods (in face detection).

 

One pretty vital problem that was ignored in the paper is the object orientation. The proposed framework seems not to be possible to deal with rotated objects with simple modification. Although one may blindly repeat the same detection algorithm at every possible angle, this would severely slow down the system. I think this should be the cause that I didn't see anyone apply the same idea to objects other than the frontal face.

2008年4月7日 星期一

[Paper Review] Names and Faces in the News

This is an astonishing work. The paper proposed a system framework that can perform face identification and classification on large scale databases from the Internet that may contain ambiguous name labeling, a broad range of individuals and extremely varying photo conditions. Four main parts constitute the system: a simple yet effective data acquisition method, a robust face rectification,  a scalable approximation of PCA and a reliable clustering algorithm.

 

I think the most important contribution of this work shoud be applying the Nystrom approximation to the PCA probem. Although this may not be purely novel ( as seen from the references), it does show that such an approximation is practical and can work well on real data. Many follow-up researches could then base on this result. The rectification, even if not highlighted in the paper, I think, is also vital for the success. The proposed algorithm can run in short time ( important for such a large database) and provide results with enough quality. The clustering algorithm is somehow straight-forward, but is OK for the purpose.

 

One weakness of the work is that it doesn't provide complete retrieval experimental results on the clustered data. The proposed measurement based on information theory appears reasonable, but it does not give an intuitive feeling of the data quality(I didn't understand it either :S). I am currently working on a problem that also deals with Internet photo collection and I believe the former parts of the paper did give me much inspiration.

[Paper Review] Object Recognition as Machine Translation: Learning a Lexicon for a Fixed Image Vocabulary

This paper addressed the problem of image annotation by estimating a probability relationship between image segments and text words. The main advantage of this approach is the ability to establish direct correspondence between the two worlds so that one can know which kinds of segments contributes to which words. In this way the user can retrieve images with the conventional text search without manual pre-annotation for the database.

 

The probability table is estimated using EM approach. As usual, we first extract features from image segments and perform vector quantization to represent the high dimensional continous pixel domain information in a discrete space. The EM algorithm then compute the  probability for a word given a blob(the quantized feature) by repeatedly doing assignments from probabilities and estimating probabilities from assignments. After we get this probability table, we can annotate images with the highest probability words of the segments in them.

 

Although the approach seems to be promising, it is obvious that the experimental performance is not very good. In the annotation test, almost all the image retrieval queries return with a precision smaller than 0.4, and only a few words can be successfully queried if the threshold is increased. The situation is better in the correspondence test, but the prediction rate are still lower than 0.5 in general. Finally, the two refinements the author proposed( thresholding and merging) do not improve the performance much both. I believe the problem comes from the faulty segmentation and the imperfect features. General image segmentation is so far an active research area where no ideal has be found. On the other hand, some erroneous labelings look to result from bad feature representations(classifying sky with clouds as water). The visual word approach would not suffer from this problem ( or to a less extent) since it requires only two levels of abstraction(feature, clustering), not three(seg, feature, clustering).

....

Windows Live Writer eats one of my posts...@@"

 

Please be careful if you use it too... You have to open a new file before starting up a new post.. or WLW would replace the last one with the new post... xxx

2008年3月31日 星期一

[Paper Review] Similarity Search in High Dimensions via Hashing

This paper talked about the Locality-Sensitive Hashing(LSH) algorithm for the ANN search and improved over the former results to guarantee a sub-linear time complexity. LSH is a popular algorithm in the ANN field because it can provide tight theoretical estimates and upper bounds for many behaviours of the algorithm. This is useful if such assurance is appreciated or critical (e.g. real-time application).

 

The general concept of LSH is simple: if we can somehow design a hash function that could always hash close points into nearby bins, then we need only to look at a few bins when finding the nearest neighbor for one query. For different distance metrics, the design of the hash function would differ. In this paper, the author concentrate on the L1 norm and argued that L2 norm can be approximated through appropriate scaling and rotation on the data. It is interesting for me to here that but I didn't get time to read the original material. Please tell me if you can explain this :).

 

The hashing of L1 norm LSH goes in this way. We first scale the point coordinates so that the smallest seperation of each dimension exceeds one. The coordinates of each point are then discretized, "unarized" and concatnated to form a single binary vector. The LSH function would consists of several sub-functions, each sampling a number of bits from the point binary vector at pre-determined, randomly chosen positions. Each point is put into one hash bins according to the resulting value of each such sub-function. To deal with the huge number of possible bins, the bins are further hashed using standard hashing techniques. When ones need to find the nearest neighbor of a query point, they can go through the whole process and look only at the points in the bins indicated by hash functions. By careful choosing of the number of sub-functions and sampling positions, the LSH can guarantee a high probability of finding true nearest neighbor. In fact, LSH resembles histograms. The sub-functions of LSH are simply a collection of low-dimension histograms with the hope that the true NN would fall in the same cell as the query in at least one of them.

 

The experiments in the paper is proper. Especially, the choice of disk access measurement is brilliant because many large scale problems generates feature pools of millions to tens of billions data points, which is impossible to fill in the main memory. The question is that the error seems to climb up considerably from the first dataset to the second. I am wondering if this is a problem of LSH in dealing with large data. After all an error of 15% on the average does not sounds attractive. Another problem of LSH would be the need to determine the number of sampling position(seeing from the experiments). It is troublesome as this is not required in tree-based ANN search.

2008年3月24日 星期一

[Paper Review] Algorithms and Applications for Approximate Nonnegative Matrix Factorization

This paper is a general survey of recent advances in the NMF area. Given a non-negative matrix A, NMF is to find a pair of also non-negative matrix W and H such that the Frobenius norm of A-WH is minimized. The non-negativity has the advantage that it properly reflects some physical limitations that only non-substractive combinations of features are acceptable(for ex. topics of textual data). In this way, NMF differs from the classical SVD decomposition where no such non-negative restriction is imposed.

 

The benefits of non-negativity does not come without price. In contrary to SVD, NMF is nonconvex in both W and H and requires numerical methods that deals with non-linear optimization. According to the paper, this seems to be the core problem that researchers work on in this area. The original multiplicative update algorithm, although simple, has the problem of converging to saddle points and is very slow. General methods such like gradient descents apply to NMF, but require a good mechanism to determine stepsize. The alternating least square approach seems to be the fast method for now and enjoys the flexibility of "escaping" from zero entries due to the projection step.  Nevertheless, it may still converge to a saddle point. Thus, up to the present, it does not exist an excellent algorithm for NMF in general. Users intend for local minima should initialize the optimization with different start points and check the results for optimality.

 

In general, fundamental mathematic formulations often do not suit a specific problem, and prior knowledges of the problem should be brought in for a good solution to be got. This is even more critical in NMF because the optimal solution is not unique here. The paper addressed several common extensions to the cost function of NMF and validate the value of the most basic one(regularization on magnitudes of H or W) through experiments.  In my personal experience, the regularization(smoothness) terms are usually important factors that may affect the solution quality and require the special care.

 

The two experiments in the paper are the text mining and the spectral data analysis. The results seem quite good, but the author did not put on the comparison with other methods, so I can't judge the power of the NMF approach properly. Indeed, the result on text mining is visually similar to the one in the pLSA paper. It is shown in [1] that NMF with Frobenius norm replaced by KL divergence is essentially the same as pLSA and both kinds of problems are instances of the more general multinomial PCA problem. The connection between those methods and the performance comparison on different problems should be a good research topic.

[Paper Review] Probabilistic Latent Semantic Indexing

The author proposed a probabilistic formulation of latent semantic model on count data. Assume now we have a corpus of text documents, the formulation consists of a set of factors, each is a combination of word-to-factor probabilities, document-to-factor probabilities and mixing proportions of that factor. Given the number of desired factors, the whole model is optimized using tempered expectation minimization(TEM) with some held-out data to avoid over-fitting. The resulting mixture model can then be used in various information retrieval problems.


The most important value I see here is the ability to extend the model with smaller models by linear combination of individual probability estimates. This directly avoids the need to proceed an entire optimization for the new model as in the case of LSA, which is certainly time-consuming. Another benefit is that one may not be required to know the optimal number of factors prior to applying the algorithm. Furthermore, the extended model would gain additional robustness to noise due to the model averaging. In this aspect, I think pLSA is a solid improvement of LSA.


To validate the usefulness of pLSA, the author also conducted several experiments on document retrieval. Although the performance is indeed superior to the original LSA and term-matching methods, I have to say I did not see great improvements here. In most cases, the pLSA variants go ahead of LSA only by a small margin (less than 8 percent) and the the use of idf seems to be equally important. I doubt if there still exists some undiscovered significant properties of pLSA in this paper.


Another shortcoming of pLSA that I found on the Wikipedia is the problem of over-fitting. This is somehow reasonable as pLSA is non-deterministic(to the choice of held-out data) and the solution space it explores is highly non-linear. Whether there exists good solution to this problem then requires further reading.

2008年3月13日 星期四

Windows Live Writer

test.

a d g
b e h
c f i

 

yaya

2008年3月10日 星期一

[Paper Review] Nonlinear Dimensionality Reduction by Locally Linear Embedding

As the article is published on popular science readings, obviously we can't anticipate to find implementation details in it. Unfortunately, I didn't find enough time to read the extended readings on the list, so I would briefly talk about my opinions on this algorithm.

The concept of LLE is to capture the global underlying relationship(aka, manifold) of the data by inspecting the local neighborhood of each observation. The algorithm first tries to explain each point by a weighting sum of its neighbors, and secondly finds a low dimensional coordinate system that best preserves this weighting relationship. At the first glance, this might seem to be strange: how can one get a global sense by only looking at local neighbors? The fact is that in most cases the distances between points too far away are not very relevant to internal mechanism of data generation. We just have to know they are distant from each other. This is well illustrated by the swiss-roll example in the paper. Linear methods like PCA would incorrectly project the data to a plane because they consider the distance between every two points "equally important".

This idea may not always work. If the distribution of data points is not well-behaved(e.g. densely clustered) or every distance measurements are important, LLE would fail. I have read one work in texture synthesis just falling into this category. The thought here is to perform dimension reduction on the space of texture patches while preserving their appearance difference, so that one can boost the performance by considering only the low dimensional counterpart in synthesis. Because patches of one type of texture look quite similar, LLE wrongly mixes up the data and traditional methods like PCA win by maintaining the total scatter. The applicability of manifold methods certainly depends on the data type.

[Paper Review] Eigenfaces vs. Fisherfaces - Recognition Using Class Specific Linear Projection

The main contribution of this work should be the application of Fisher Linear Discriminant method to face recognition. The idea of FLD(or more conveniently, LDA) is quite simple: as our objective is to classify the input into several categories and we are going to compress the dimensions, why not choose a dimension-reduction mapping that best favors this purpose. This is in contrary to methods like SVM where the data are projected to a higher dimensional space for better separation.

Since FLD is designed for classification purpose, it is no wonder it can outperform pure PCA as the later has no such special intention. Just like what is shown in the paper, PCA aims to retain the most scatter of data points only, so the resulting axes may just mix up the originally isolating groups. As FLD tries to maximize the ratio of between-class distance to within-class distance during optimization, it may not suffer from these kinds of chaos. Nevertheless, one can find a projection subspace such that the within-class distance become zero due to the large difference between the observation's dimension and the number of observations. Thus, a pre-processing PCA is still required.

Another observation in the paper is based on the fact that the reflectance of a Lambertian surface is solely determined by the light source direction, the surface normal and the albedo of the object. As the later two are already determined, the real appearance of each pixel(ignoring shadow and self-reflectance etc.) lies on a 3D space spanned by the the positions of the light source. Although this might be a common sense in the rendering field, it seems rather interesting to me.

One thing that also interests me is the glasses recognition by FLD. In fact, I have read one work that attacks a similar problem by explicitly building a subspace that accounts for the appearance variation from layered objects. The result of their method on glasses removal looks like that(images from the paper):



This picture perfectly underlines the value of class-aware subspace projection.

2008年3月6日 星期四

[Paper Review] Eigenfaces for Recognition

This paper talked about the application of principal component analysis to face detection and identification. It seems that this is the first complete work in applying PCA to face recognition, although such techniques has become quite standard (and somehow old and out-of-date) now. Anyway, we should bear in mind it is a 1991 paper. In that time, even acquiring digital color videos can be awesome for the public.

The system comprises two main components: face detection and face classification. The author proposes to use PCA to extract key directions that best explaining the variation of face images in high dimensional space. Those eigenvectors obtained are supposed to be the most important factors that constitutes a face. Thus, by projecting unknown image to this reduced "face space", one can then judge if an image is a face from the projected length in each direction. Furthermore, in case that the image is indeed a face, we may compare it in face space with the pre-built database of individual faces to identify its personality. Those are the general ideas of the paper.

Admittedly, many parts of the system are obsolete from today's perspective. For example, Adaboost seems to be the most successful method in face detection now. By using integral images that accelerates Haar-feature computation and by filter cascading, it is shown in [Viola 04] that accurate face detection of various sizes can be done real-time on conventional PC. It is unlikely correlation-based methods like subspace projection can achieve the same performance. On the other hand, there were also many methods that better suit face identification proposed later (for ex., the fisher-face). The value of this paper should be in the idea to use principal components to deal with unknown variations of high dimensional data, which certainly benefits subsequent researches like Active Appearance Model. Another vital contribution of this paper, I think, is a practical method to perform eigenvector decomposition on relative small number of observations from high dimensional space. This is critical even in today.

2008年3月2日 星期日

[Paper Review] Scale & Affine Invariant Interest Point Detectors

This paper proposed the Harris-Affine detector, which is intended to work under substantial image affine transformation(e.g. > 40 degree viewpoint change). We should note that even it is claimed as fully affine invariant, the detector still contains a multi-scale Harris part that is only similarity invariant. This is common to many related work due to the computational time issue.

Harris-Affine detector first uses the Harris detector to locate possible interest points in the scale space. The found points then undergo an iterative refinement process to compute the accurate shape, scale and location information. At each iteration, the algorithm finds a new scale value according to the newly refined point location. The scale value is selected w.r.t. the LoG response and to the resulting isotropy of the second moment matrix. It then searches the maximum Harris response in this scale for the new point location. The patch around the found location at the found scale are then used to compute the next second moment matrix and to refine the local straightening transformation. Those steps comprise a complete iteration. The iteration stopped when a nearly perfect isotropy is achieved, often in less than 10 iterations.

The experiments show a great improvement over isotropic detectors such like LoG, DoG or Harris-Laplace under large view angle difference. However it also exposes one problem: with better ability to resist affine transformation, the detector suffers from lower discriminative regions after most distortions are restored. Another problem is the runtime. In the experiments the Harris-Affine detector requires tens more time to complete as compared to the fastest ones. It is certainly a problem if the much more time is of worth. To this point, I think a good option is to combine several detectors so that we can take the merits from all of them. For example, we can use SIFT to find reliable matching images from a database and then apply Harris-Affine detectors to estimate more precise geometric relations. Such a hybrid approach should be more useful than applying any single one only.

[Paper Review] Distinctive Image Features from Scale-Invariant Keypoints

This paper introduces the celebrated SIFT feature to a great detail and received nearly 2000 citations on the Google Scholar. SIFT is successful in its general applicability to almost every kind of image distortion with a fine enough performance. Therefore, in many cases when people need some features to be matched, they choose SIFT.

The name "SIFT" indeed consists of two parts: the feature detector and the descriptor. The SIFT detector checks local extrema over the scale space in the DoG pyramid, while rejecting points with low contrasts or high edge responses. The remaining points' locations are then refined using parabola approximation. The SIFT descriptor then computes the principal gradient orientations over the nearby patches of the points to obtain a local rotation-invariant coordinate system. Finally, a grid-shape gradient orientation histogram is overlaid to describe each feature point as a 128-dim vector.

Although the concept is not much harder than other comparative studies, I have to say that a reliable implementation of SIFT is still a complicated work. I have written one copy of SIFT one year ago for my own needs, but just recently I can even find a bug in it. Here is a snapshot about the SIFT feature matching, you can see that SIFT is really robust to a variety of distortions:



SIFT is not perfect anyway. Theoretically, the DoG extrema is not invariant to affine transformations so we can't anticipate SIFT to work under large viewpoint changes. Nevertheless, it is shown in most cases SIFT continues to outperform other (nearly) affine-invariant feature detectors. I suspect the reasons are that uniformly scaled patches are more distinctive than stretched ones while sacrificing some robustness to affine transformation, and that the design of SIFT descriptor enables some mis-registration of features. This fact requires some further studies.

The SIFT descriptor is also challenged by some recent researches too. It is now known that GLOH(Gradient Location and Orientation Histogram) and steerable filter responses are better descriptors in cases of large database matching. Automatic learning of feature descriptors certainly would be the next main stream in feature matching.

2008年2月25日 星期一

[Paper Review] Image Retrieval: Ideas, Influences, and Trends of the New Age

This paper is quite long and I have to admit that I didn't spend enough time on it. Perhaps it is because of the lack of background knowledge that make me hard to go ahead every sentence. I have tried to apply the first pass as described in "How to read a paper?" but it doesn't help too much... I would talk about the read materials in the following anyway.

This comprehensive survey is for the progress in the last decade in CBIR(Content-Based Image Retrieval). It talks about 1) the three design aspects in both user's and system's perspective(skipped here) 2) key techniques developed in this time interval 3) CBIR derived problems and 4) evaluation metrics for CBIR system. One word I would like to quote from the paper is that text is man's creation while images are there by itself. It is no wonder we find images much harder to drive than text.

The core obstacles about CBIR is around two main problems, namely, how to express the image in a more useful formulation and how to establish similarity relationships between images. In fact, we can say that every scientific research field faces the first problem and every abstractive domain is subject to the second. The researchers in CBIR tackled the first problem by feature extraction at different levels, which the authors have found a tendency from global ones to local ones. Global features, while capture overall characteristics of the entire image, is often too cumbersome to deal with local transformations that exposed by general images. Thus, local descriptors received more and more attention in the decade. Beyond local features is the summarization or clustering, that aims to give image signatures that can help later similarity assessment, which can be viewed as "spaces" of images as a whole.

The similarity estimation seems to me much alike to machine learning methods. In this part, researchers work to find some reasonable energy functions, such that after optimization w.r.t. ground truth data, can be used to compute a similarity measure between image signatures. It comes here one can resort to pre-processing methods (clustering, categorization) or user relevance feedback to ease the burden of similarity measurements and to speed up the performance. Among those, relevance feedbacks are of distinctness, not only because it is the most direct communication with users, but also it requires a careful design to "catch" the user.
Finally, it is a novel direction to fuse informations from other medias for better content retrieval.

In the long way of CBIR it has also derived many branch fields and research topics. Those new problems include image annotation, that aims to give conceptual textual descriptions for images, pictures for story, that finds pictures best depicting a concept(reverse of the first one), aesthetics measurement, that measure aesthetic perception for images, imagery security, that exploits the limitation of CBIR system to help personality identification and many so on. It is also roughly mentioned about CBIR system evaluation at the end of the paper, to show the lack of related researches.

From the author's perspective, we should pay more attentions on application oriented aspects since it directly influence the success of any system and have their own rights to be "considered equally important". The growth of image sizes also pose a hard barrier for the future of CBIR. In any case, we can anticipate a continuing progress in this field.

[Paper Review] How to Give a Good Research Talk?

This paper aims to help people to give a good talk for their researches, especially in the computer science domain. I found many points in this paper valuable and they should be thought of every time when preparing presentation. In the following I will talk about those points for summarization.

One thing so important that I have to repeat it here is the two questions in the beginning:
  • Who is my primary audience?
  • If someone remembers only one thing from my talk, what would I like it to be?
We shall ask ourselves both questions each time before going into slides, especially the second one. In my experience, quite often we regard the presentation as an artwork or a performance, but forget that all the most vital thing is to build a intended comprehension for the audience to your ideas. To this end, we should evaluate every thing in our slides: if it help to make clear?

The other three points proposed in the section 2, in the order: using examples, pruning and to be honest, are all important things for presentation too. I usually intend to conceal problems in the presentation in the past. This is certainly one thing I should avoid in the future.

The section 3 is somehow out-of-date as the popularization of PowerPoint recently. Nevertheless, we can still find many shared pitfalls. For example, people often incline to put too many things on one single slide, or just type exactly the words they are going to speak. The most important thing here, I think, is not to start writing slides too early. This effectively prevent you from putting too much content...

Finally, I have found two points important in the last section: being careful about visual tricks and timing. Beginner presenters usually use many animations on their slides, as it is simple to add in PowerPoint. Most of the "visual aids" are in fact annoying and distracting. Over-running is also something we commit quite often. Just like described in the paper, it is selfish and rude, and we should all try to work things out in time.

2008年2月21日 星期四

[Paper Review] How to Read a Paper?

The author mainly talks about his "3-pass" guidelines for efficient paper reading in this paper. The audience he targets on is graduate students who has just started doing researches, but not limited to. I will summarize the 3-passes here along with my own opinion:

The first pass is to skim over the whole paper, partly like a common people without background knowledge would do. In this pass, one looks at abstracts, introductions, figures and tables to get a roughly understanding about this paper. The author suggests this fast pass to be a good indicator for further "action", either taking one step ahead or putting it aside the desk. One important thing I have learned from this section is that one should put more emphasis on those parts when writing paper, otherwise it would keep readers away from your clever ideas.

In the second pass, the author expects us to read a paper "as we usually do". This includes a complete go-through of the paper and to think over the main ideas proposed. A common mistake I often commit is to read every paper I find to this depth, which is very time expensive if it is not really useful. Nevertheless, I think a careful reading is still necessary(and usually proves to be) for many papers we decide to click "download".

If we really find something beneficial to our work and just don't feel enough after the second pass, the third pass would be a perfect end for our need. The author recommend to virtually re-implement the paper in your mind, so that you can re-create the work using the same assumption. I totally agree with him at this point. Not until pen and paper or codes are reached, we seldom truly understand the deep meaning hidden in the paper.

At last, the author also talks about general flows to do survey for a unfamiliar field. One thing I would like to highlight here is to browse key researcher's website for a whole picture about how this field is going on. It is often much more easy to do it this way than to scan proceedings' website...

2008年2月19日 星期二

First shot!

Cheer for the opening of my first blog!

It is amusing that I start to blog just because of academic reasons...