2008年4月29日 星期二

Combined Labeled and Unlabeled Data with Co-Training

This paper proposed the idea of co-training: use inexpensive unlabeled data for improving initially weak classifiers. The weak classfiers are firstly trained by a few labeled examples to achieve coarse discrimination. It then repeatedly selects some instances from the unlabeled data pool, labels them and re-trains itself using these newly labeled data along with the ones it has labeled before. This approach can dramatically improve the performance of classifiers as it is learnt on a larger set of "training data".

 

The working situation of this approach requires that each example contains two kinds of (but not limited to) independent cues for classification. Furthermore, the author assumes that both cues are consistent in their decision, i.e., take f1 and f2 as the classifiers for cues x1 and x2, f1(x1) should equal f2(x2) at probability close to 1. These two conditions allow cues of one kind to be connected ( and thus strengthened) through "bridging" on the other kind. For example, assume observation 1 contains feature A and B and observation 2 contains contains feature C and D in the first and second kind respectively. Given a new observation with feature C and B, we may refer all the three observations to the same class as we believe both cues are in agreement. This forms the basis of co-training.

 

The experiments on real data in the paper did support the idea of co-training. Nevertheless, I think one problem here is whether we should put all the work to the learning algorithm. If we do get two kinds of feature for each example, why don't we first perform dimension reduction such like LDA so that the examples of same class would cluster together. In this way, maybe the so-called connected component sub-structure could be discovered before any learning algorithm is done. I would like to know if anyone has done research on such topics.

2008年4月14日 星期一

[Paper Review] Improved Boosting Algorithms Using Confidence-rated Predictions

This paper is a complete arrangement of the original AdaBoost algorithm. It discussed the algorithm's behavior in many aspects, gave theoretical bounds and proposed several improvements. The author also validated those new ideas with experiments on real data. The new, improved AdaBoost is more convenient as one can assign real-valued responses in the weak learners.

 

The most attractive feature of AdaBoost should be its simplicity. Given an almost arbitrary configurable classifier, one can enhance its performance considerably by plainly following those simple steps of AdaBoost. The original classifier can be very crude, maybe only slightly better than random guess. In the training progress, AdaBoost will force the weak classifier to "learn" more from the examples it failed before, thus increasing the accuracy of the classifier. The training error of AdaBoost approaches zero at an exponential rate of iterations. The final classifier is a linear combination of all the classifiers used in the progress. This voting structure is a compromise of classifiers, each looks at a small portion of the high dimensional instance space.  For multi-label or multi-class problems, one can map them to the binary case by forming tuples of "instance-label" or "instance-class" and exploring their validity.

 

One obvious weakness of AdaBoost is the neglect of the interaction among individual classifiers. Much likely, it would be the case one could assure a prediction with high confidence if specific two or three classifiers say "yes" at the same time. The linear combination rule of AdaBoost puts the burden on the adaptability of the weak learner, which may not always be a good idea. Another extension of AdaBoost might be to modify it so that inherently different kinds of classifiers can work together.

[Paper Review] Rapid Object Detection using a Boosted Cascade of Simple Features

This paper showed how to apply machine learning techniques in the area of high-speed object detection (or more precisely, template matching). The idea is to choose one kind of coarse features that can be computed very efficiently and to improve their performance by AdaBoost learning and filter cascading. The proposed system can run in real-time in face-detection with comparable accuracy to the state-of-the-art. 

 

Apparently, the most important contribution of this work is in extending AdaBoost so that one can choose from a large pool of available features to form a small, yet equivalently discriminative subset. The extension is not complex, however - you only need to repeat the training in each iteration for all available classifiers (features) instead and select the most promising one. The other two ideas of this work are more straightforward. The integral image is a synonym of one old trick in dynamic programming, and filter cascading is something everyone would possibly do in high-speed application. Nevertheless, the resulting system does demonstrate a great improvement over existing methods (in face detection).

 

One pretty vital problem that was ignored in the paper is the object orientation. The proposed framework seems not to be possible to deal with rotated objects with simple modification. Although one may blindly repeat the same detection algorithm at every possible angle, this would severely slow down the system. I think this should be the cause that I didn't see anyone apply the same idea to objects other than the frontal face.

2008年4月7日 星期一

[Paper Review] Names and Faces in the News

This is an astonishing work. The paper proposed a system framework that can perform face identification and classification on large scale databases from the Internet that may contain ambiguous name labeling, a broad range of individuals and extremely varying photo conditions. Four main parts constitute the system: a simple yet effective data acquisition method, a robust face rectification,  a scalable approximation of PCA and a reliable clustering algorithm.

 

I think the most important contribution of this work shoud be applying the Nystrom approximation to the PCA probem. Although this may not be purely novel ( as seen from the references), it does show that such an approximation is practical and can work well on real data. Many follow-up researches could then base on this result. The rectification, even if not highlighted in the paper, I think, is also vital for the success. The proposed algorithm can run in short time ( important for such a large database) and provide results with enough quality. The clustering algorithm is somehow straight-forward, but is OK for the purpose.

 

One weakness of the work is that it doesn't provide complete retrieval experimental results on the clustered data. The proposed measurement based on information theory appears reasonable, but it does not give an intuitive feeling of the data quality(I didn't understand it either :S). I am currently working on a problem that also deals with Internet photo collection and I believe the former parts of the paper did give me much inspiration.

[Paper Review] Object Recognition as Machine Translation: Learning a Lexicon for a Fixed Image Vocabulary

This paper addressed the problem of image annotation by estimating a probability relationship between image segments and text words. The main advantage of this approach is the ability to establish direct correspondence between the two worlds so that one can know which kinds of segments contributes to which words. In this way the user can retrieve images with the conventional text search without manual pre-annotation for the database.

 

The probability table is estimated using EM approach. As usual, we first extract features from image segments and perform vector quantization to represent the high dimensional continous pixel domain information in a discrete space. The EM algorithm then compute the  probability for a word given a blob(the quantized feature) by repeatedly doing assignments from probabilities and estimating probabilities from assignments. After we get this probability table, we can annotate images with the highest probability words of the segments in them.

 

Although the approach seems to be promising, it is obvious that the experimental performance is not very good. In the annotation test, almost all the image retrieval queries return with a precision smaller than 0.4, and only a few words can be successfully queried if the threshold is increased. The situation is better in the correspondence test, but the prediction rate are still lower than 0.5 in general. Finally, the two refinements the author proposed( thresholding and merging) do not improve the performance much both. I believe the problem comes from the faulty segmentation and the imperfect features. General image segmentation is so far an active research area where no ideal has be found. On the other hand, some erroneous labelings look to result from bad feature representations(classifying sky with clouds as water). The visual word approach would not suffer from this problem ( or to a less extent) since it requires only two levels of abstraction(feature, clustering), not three(seg, feature, clustering).

....

Windows Live Writer eats one of my posts...@@"

 

Please be careful if you use it too... You have to open a new file before starting up a new post.. or WLW would replace the last one with the new post... xxx