2008年3月31日 星期一

[Paper Review] Similarity Search in High Dimensions via Hashing

This paper talked about the Locality-Sensitive Hashing(LSH) algorithm for the ANN search and improved over the former results to guarantee a sub-linear time complexity. LSH is a popular algorithm in the ANN field because it can provide tight theoretical estimates and upper bounds for many behaviours of the algorithm. This is useful if such assurance is appreciated or critical (e.g. real-time application).

 

The general concept of LSH is simple: if we can somehow design a hash function that could always hash close points into nearby bins, then we need only to look at a few bins when finding the nearest neighbor for one query. For different distance metrics, the design of the hash function would differ. In this paper, the author concentrate on the L1 norm and argued that L2 norm can be approximated through appropriate scaling and rotation on the data. It is interesting for me to here that but I didn't get time to read the original material. Please tell me if you can explain this :).

 

The hashing of L1 norm LSH goes in this way. We first scale the point coordinates so that the smallest seperation of each dimension exceeds one. The coordinates of each point are then discretized, "unarized" and concatnated to form a single binary vector. The LSH function would consists of several sub-functions, each sampling a number of bits from the point binary vector at pre-determined, randomly chosen positions. Each point is put into one hash bins according to the resulting value of each such sub-function. To deal with the huge number of possible bins, the bins are further hashed using standard hashing techniques. When ones need to find the nearest neighbor of a query point, they can go through the whole process and look only at the points in the bins indicated by hash functions. By careful choosing of the number of sub-functions and sampling positions, the LSH can guarantee a high probability of finding true nearest neighbor. In fact, LSH resembles histograms. The sub-functions of LSH are simply a collection of low-dimension histograms with the hope that the true NN would fall in the same cell as the query in at least one of them.

 

The experiments in the paper is proper. Especially, the choice of disk access measurement is brilliant because many large scale problems generates feature pools of millions to tens of billions data points, which is impossible to fill in the main memory. The question is that the error seems to climb up considerably from the first dataset to the second. I am wondering if this is a problem of LSH in dealing with large data. After all an error of 15% on the average does not sounds attractive. Another problem of LSH would be the need to determine the number of sampling position(seeing from the experiments). It is troublesome as this is not required in tree-based ANN search.

2008年3月24日 星期一

[Paper Review] Algorithms and Applications for Approximate Nonnegative Matrix Factorization

This paper is a general survey of recent advances in the NMF area. Given a non-negative matrix A, NMF is to find a pair of also non-negative matrix W and H such that the Frobenius norm of A-WH is minimized. The non-negativity has the advantage that it properly reflects some physical limitations that only non-substractive combinations of features are acceptable(for ex. topics of textual data). In this way, NMF differs from the classical SVD decomposition where no such non-negative restriction is imposed.

 

The benefits of non-negativity does not come without price. In contrary to SVD, NMF is nonconvex in both W and H and requires numerical methods that deals with non-linear optimization. According to the paper, this seems to be the core problem that researchers work on in this area. The original multiplicative update algorithm, although simple, has the problem of converging to saddle points and is very slow. General methods such like gradient descents apply to NMF, but require a good mechanism to determine stepsize. The alternating least square approach seems to be the fast method for now and enjoys the flexibility of "escaping" from zero entries due to the projection step.  Nevertheless, it may still converge to a saddle point. Thus, up to the present, it does not exist an excellent algorithm for NMF in general. Users intend for local minima should initialize the optimization with different start points and check the results for optimality.

 

In general, fundamental mathematic formulations often do not suit a specific problem, and prior knowledges of the problem should be brought in for a good solution to be got. This is even more critical in NMF because the optimal solution is not unique here. The paper addressed several common extensions to the cost function of NMF and validate the value of the most basic one(regularization on magnitudes of H or W) through experiments.  In my personal experience, the regularization(smoothness) terms are usually important factors that may affect the solution quality and require the special care.

 

The two experiments in the paper are the text mining and the spectral data analysis. The results seem quite good, but the author did not put on the comparison with other methods, so I can't judge the power of the NMF approach properly. Indeed, the result on text mining is visually similar to the one in the pLSA paper. It is shown in [1] that NMF with Frobenius norm replaced by KL divergence is essentially the same as pLSA and both kinds of problems are instances of the more general multinomial PCA problem. The connection between those methods and the performance comparison on different problems should be a good research topic.

[Paper Review] Probabilistic Latent Semantic Indexing

The author proposed a probabilistic formulation of latent semantic model on count data. Assume now we have a corpus of text documents, the formulation consists of a set of factors, each is a combination of word-to-factor probabilities, document-to-factor probabilities and mixing proportions of that factor. Given the number of desired factors, the whole model is optimized using tempered expectation minimization(TEM) with some held-out data to avoid over-fitting. The resulting mixture model can then be used in various information retrieval problems.


The most important value I see here is the ability to extend the model with smaller models by linear combination of individual probability estimates. This directly avoids the need to proceed an entire optimization for the new model as in the case of LSA, which is certainly time-consuming. Another benefit is that one may not be required to know the optimal number of factors prior to applying the algorithm. Furthermore, the extended model would gain additional robustness to noise due to the model averaging. In this aspect, I think pLSA is a solid improvement of LSA.


To validate the usefulness of pLSA, the author also conducted several experiments on document retrieval. Although the performance is indeed superior to the original LSA and term-matching methods, I have to say I did not see great improvements here. In most cases, the pLSA variants go ahead of LSA only by a small margin (less than 8 percent) and the the use of idf seems to be equally important. I doubt if there still exists some undiscovered significant properties of pLSA in this paper.


Another shortcoming of pLSA that I found on the Wikipedia is the problem of over-fitting. This is somehow reasonable as pLSA is non-deterministic(to the choice of held-out data) and the solution space it explores is highly non-linear. Whether there exists good solution to this problem then requires further reading.

2008年3月13日 星期四

Windows Live Writer

test.

a d g
b e h
c f i

 

yaya

2008年3月10日 星期一

[Paper Review] Nonlinear Dimensionality Reduction by Locally Linear Embedding

As the article is published on popular science readings, obviously we can't anticipate to find implementation details in it. Unfortunately, I didn't find enough time to read the extended readings on the list, so I would briefly talk about my opinions on this algorithm.

The concept of LLE is to capture the global underlying relationship(aka, manifold) of the data by inspecting the local neighborhood of each observation. The algorithm first tries to explain each point by a weighting sum of its neighbors, and secondly finds a low dimensional coordinate system that best preserves this weighting relationship. At the first glance, this might seem to be strange: how can one get a global sense by only looking at local neighbors? The fact is that in most cases the distances between points too far away are not very relevant to internal mechanism of data generation. We just have to know they are distant from each other. This is well illustrated by the swiss-roll example in the paper. Linear methods like PCA would incorrectly project the data to a plane because they consider the distance between every two points "equally important".

This idea may not always work. If the distribution of data points is not well-behaved(e.g. densely clustered) or every distance measurements are important, LLE would fail. I have read one work in texture synthesis just falling into this category. The thought here is to perform dimension reduction on the space of texture patches while preserving their appearance difference, so that one can boost the performance by considering only the low dimensional counterpart in synthesis. Because patches of one type of texture look quite similar, LLE wrongly mixes up the data and traditional methods like PCA win by maintaining the total scatter. The applicability of manifold methods certainly depends on the data type.

[Paper Review] Eigenfaces vs. Fisherfaces - Recognition Using Class Specific Linear Projection

The main contribution of this work should be the application of Fisher Linear Discriminant method to face recognition. The idea of FLD(or more conveniently, LDA) is quite simple: as our objective is to classify the input into several categories and we are going to compress the dimensions, why not choose a dimension-reduction mapping that best favors this purpose. This is in contrary to methods like SVM where the data are projected to a higher dimensional space for better separation.

Since FLD is designed for classification purpose, it is no wonder it can outperform pure PCA as the later has no such special intention. Just like what is shown in the paper, PCA aims to retain the most scatter of data points only, so the resulting axes may just mix up the originally isolating groups. As FLD tries to maximize the ratio of between-class distance to within-class distance during optimization, it may not suffer from these kinds of chaos. Nevertheless, one can find a projection subspace such that the within-class distance become zero due to the large difference between the observation's dimension and the number of observations. Thus, a pre-processing PCA is still required.

Another observation in the paper is based on the fact that the reflectance of a Lambertian surface is solely determined by the light source direction, the surface normal and the albedo of the object. As the later two are already determined, the real appearance of each pixel(ignoring shadow and self-reflectance etc.) lies on a 3D space spanned by the the positions of the light source. Although this might be a common sense in the rendering field, it seems rather interesting to me.

One thing that also interests me is the glasses recognition by FLD. In fact, I have read one work that attacks a similar problem by explicitly building a subspace that accounts for the appearance variation from layered objects. The result of their method on glasses removal looks like that(images from the paper):



This picture perfectly underlines the value of class-aware subspace projection.

2008年3月6日 星期四

[Paper Review] Eigenfaces for Recognition

This paper talked about the application of principal component analysis to face detection and identification. It seems that this is the first complete work in applying PCA to face recognition, although such techniques has become quite standard (and somehow old and out-of-date) now. Anyway, we should bear in mind it is a 1991 paper. In that time, even acquiring digital color videos can be awesome for the public.

The system comprises two main components: face detection and face classification. The author proposes to use PCA to extract key directions that best explaining the variation of face images in high dimensional space. Those eigenvectors obtained are supposed to be the most important factors that constitutes a face. Thus, by projecting unknown image to this reduced "face space", one can then judge if an image is a face from the projected length in each direction. Furthermore, in case that the image is indeed a face, we may compare it in face space with the pre-built database of individual faces to identify its personality. Those are the general ideas of the paper.

Admittedly, many parts of the system are obsolete from today's perspective. For example, Adaboost seems to be the most successful method in face detection now. By using integral images that accelerates Haar-feature computation and by filter cascading, it is shown in [Viola 04] that accurate face detection of various sizes can be done real-time on conventional PC. It is unlikely correlation-based methods like subspace projection can achieve the same performance. On the other hand, there were also many methods that better suit face identification proposed later (for ex., the fisher-face). The value of this paper should be in the idea to use principal components to deal with unknown variations of high dimensional data, which certainly benefits subsequent researches like Active Appearance Model. Another vital contribution of this paper, I think, is a practical method to perform eigenvector decomposition on relative small number of observations from high dimensional space. This is critical even in today.

2008年3月2日 星期日

[Paper Review] Scale & Affine Invariant Interest Point Detectors

This paper proposed the Harris-Affine detector, which is intended to work under substantial image affine transformation(e.g. > 40 degree viewpoint change). We should note that even it is claimed as fully affine invariant, the detector still contains a multi-scale Harris part that is only similarity invariant. This is common to many related work due to the computational time issue.

Harris-Affine detector first uses the Harris detector to locate possible interest points in the scale space. The found points then undergo an iterative refinement process to compute the accurate shape, scale and location information. At each iteration, the algorithm finds a new scale value according to the newly refined point location. The scale value is selected w.r.t. the LoG response and to the resulting isotropy of the second moment matrix. It then searches the maximum Harris response in this scale for the new point location. The patch around the found location at the found scale are then used to compute the next second moment matrix and to refine the local straightening transformation. Those steps comprise a complete iteration. The iteration stopped when a nearly perfect isotropy is achieved, often in less than 10 iterations.

The experiments show a great improvement over isotropic detectors such like LoG, DoG or Harris-Laplace under large view angle difference. However it also exposes one problem: with better ability to resist affine transformation, the detector suffers from lower discriminative regions after most distortions are restored. Another problem is the runtime. In the experiments the Harris-Affine detector requires tens more time to complete as compared to the fastest ones. It is certainly a problem if the much more time is of worth. To this point, I think a good option is to combine several detectors so that we can take the merits from all of them. For example, we can use SIFT to find reliable matching images from a database and then apply Harris-Affine detectors to estimate more precise geometric relations. Such a hybrid approach should be more useful than applying any single one only.

[Paper Review] Distinctive Image Features from Scale-Invariant Keypoints

This paper introduces the celebrated SIFT feature to a great detail and received nearly 2000 citations on the Google Scholar. SIFT is successful in its general applicability to almost every kind of image distortion with a fine enough performance. Therefore, in many cases when people need some features to be matched, they choose SIFT.

The name "SIFT" indeed consists of two parts: the feature detector and the descriptor. The SIFT detector checks local extrema over the scale space in the DoG pyramid, while rejecting points with low contrasts or high edge responses. The remaining points' locations are then refined using parabola approximation. The SIFT descriptor then computes the principal gradient orientations over the nearby patches of the points to obtain a local rotation-invariant coordinate system. Finally, a grid-shape gradient orientation histogram is overlaid to describe each feature point as a 128-dim vector.

Although the concept is not much harder than other comparative studies, I have to say that a reliable implementation of SIFT is still a complicated work. I have written one copy of SIFT one year ago for my own needs, but just recently I can even find a bug in it. Here is a snapshot about the SIFT feature matching, you can see that SIFT is really robust to a variety of distortions:



SIFT is not perfect anyway. Theoretically, the DoG extrema is not invariant to affine transformations so we can't anticipate SIFT to work under large viewpoint changes. Nevertheless, it is shown in most cases SIFT continues to outperform other (nearly) affine-invariant feature detectors. I suspect the reasons are that uniformly scaled patches are more distinctive than stretched ones while sacrificing some robustness to affine transformation, and that the design of SIFT descriptor enables some mis-registration of features. This fact requires some further studies.

The SIFT descriptor is also challenged by some recent researches too. It is now known that GLOH(Gradient Location and Orientation Histogram) and steerable filter responses are better descriptors in cases of large database matching. Automatic learning of feature descriptors certainly would be the next main stream in feature matching.