Speaker
Description
Modern day full text search relies on a simple yet robust inverted index data structure to find documents matching a text query. However, this approach is limited by the required spelling checker, which often has memory requirements comparable to the index itself. In this work, a novel approach to text indexing is introduced. Based on the navigable small world data structure, it offers a two-in-one solution for search with built-in spelling correction, sacrificing some performance for a substantial memory economy. The proposed index, just like the conventional inverted index, uses separate tokens extracted from a document as the main building blocks. The tokens are organized in a graph, each node representing one token. The nodes list all the document identifiers for the documents where the token is found. The edges of the graph connect tokens that are similar in terms of edit distance. The proposed approach utilizes an arbitrary ranking function to find relevant documents, the state-of-the-art BM25 ranking function is recommended as a reasonable default. The function values are weighed to take the edit distance into account. The ability to process misspelled words allows avoiding maintaining a separate structure for spelling correction while still generating search results which account for possible misspellings in the query phrase. The search time losses are attributed to the neighbor exploration while traversing the navigable small world data structure, and thus can be mitigated by configuring the data structure to maintain a fairly low number of connections. Such a configuration also provides a hard limit to the number of edits allowed in a single query term. The paper discusses in detail the ways of picking the optimal parameters for the algorithm, as well as the limitations of the proposed approach.
Keywords: full text search, search, navigable small world, memory efficiency, text ranking, spelling correction.