2008年4月7日 星期一

[Paper Review] Object Recognition as Machine Translation: Learning a Lexicon for a Fixed Image Vocabulary

This paper addressed the problem of image annotation by estimating a probability relationship between image segments and text words. The main advantage of this approach is the ability to establish direct correspondence between the two worlds so that one can know which kinds of segments contributes to which words. In this way the user can retrieve images with the conventional text search without manual pre-annotation for the database.

 

The probability table is estimated using EM approach. As usual, we first extract features from image segments and perform vector quantization to represent the high dimensional continous pixel domain information in a discrete space. The EM algorithm then compute the  probability for a word given a blob(the quantized feature) by repeatedly doing assignments from probabilities and estimating probabilities from assignments. After we get this probability table, we can annotate images with the highest probability words of the segments in them.

 

Although the approach seems to be promising, it is obvious that the experimental performance is not very good. In the annotation test, almost all the image retrieval queries return with a precision smaller than 0.4, and only a few words can be successfully queried if the threshold is increased. The situation is better in the correspondence test, but the prediction rate are still lower than 0.5 in general. Finally, the two refinements the author proposed( thresholding and merging) do not improve the performance much both. I believe the problem comes from the faulty segmentation and the imperfect features. General image segmentation is so far an active research area where no ideal has be found. On the other hand, some erroneous labelings look to result from bad feature representations(classifying sky with clouds as water). The visual word approach would not suffer from this problem ( or to a less extent) since it requires only two levels of abstraction(feature, clustering), not three(seg, feature, clustering).

沒有留言: