miércoles, 4 de agosto de 2010

Why Lucene doesn't work properly with structured documents?

Most ranking functions, use a non-linear method to saturate the computation of the frequencies. This is due to the fact that the information gained on observing a term the first time is greater than the information gained on subsequently seeing the same term.

The non-linear method can be as simple as a logarithmic or a square-root function or more complex parameter-based approaches like BM25 k1 parameter. S. Robertson* has described the dangers to combine scores from different document fields and what are the most tipical errors when ranking functions are modified to consider the structure of the documents.

A good example of these errors is Lucene library**. Apache Lucene is a high-performance and full-featured text search engine library written entirely in Java. It is a scalable technology suitable for nearly any application that requires fulltext search. Lucene offers high-performance indexing and has become one of the most used search engine libraries in both academia and industry.

The ranking function is the core of any search engine applied to determine how relevant a document is to a given query. Lucene ranking function is built on a combination of the Vector Space Model (VSM) and the Boolean model of Information
Retrieval. The main idea behind Lucene approach is the more times a query term appears in a document relative to the number of times the term appears in the whole
collection, the more relevant that document will be to the query. Lucene uses also the Boolean model to first narrow down the documents that need to be scored based on the use of boolean logic in the query specification. Lucene is able to deal with document structure and actually implements the necessary functionalities to index structured documents based on fields. To rank these structured documents, Lucene
combines the scores from document fields. The method used by Lucene to compute the score of an structured document is based on the linear combination of the scores for each field of the document:



where



and



As we can see, Lucene’s ranking function uses the square-root of the frequency to implement the non-linear method to saturate the computation of the frequencies, but the linear combination of the tfc(tl, d) values that Lucene implements breaks the saturation effect, since field’s boost factors wc are applied after of non-linear methods are used. The consequence is that a document matching a single query term over several fields could score much higher than a document matching several query terms in one field only, which is not a good way to compute relevance and use to hurt dramatically ranking function performance.

This problem hurt search engine performance where two or more fields are used to represent a document.

Fortunately, Rober Muir, from Lucene Commiters Team, is working on this problem, and hopefully it will be fixed soon. In the meantime, if you need index and retrieve structured documents using Lucene, you can use the BM25F implementation for Lucene that is available at http://nlp.uned.es/~jperezi/Lucene-BM25/

* Robertson, S., Zaragoza, H., and Taylor, M. 2004. Simple BM25 extension to multiple weighted fields. In Proceedings of the Thirteenth ACM international Conference on information and Knowledge Management (Washington, D.C., USA, November 08 - 13, 2004). CIKM '04. ACM, New York, NY, 42-49. DOI= http://doi.acm.org/10.1145/1031171.1031181
** http://lucene.apache.org/java/docs/index.html

No hay comentarios:

Publicar un comentario