MODELS FOR COMPRESSION IN FULL-TEXT RETRIEVAL SYSTEMS

Witten, Ian H.; Nevill, Craig G.; Bell, Timothy C.

MODELS FOR COMPRESSION IN FULL-TEXT RETRIEVAL SYSTEMS

Files

1990-403-27.pdf (3.67 MB)

Date

1990-08-01

Authors

Witten, Ian H.

Nevill, Craig G.

Bell, Timothy C.

Abstract

Text compression systems operate in a stream-oriented fashion which is inappropriate for databases that need to be accessed through a variety of retrieval mechanisms. This paper develops models for full-text retrieval systems which (a) compress the main text so that it can be randomly accessed via synchronization points; (b) store the text's lexicon in a compressed form that can be efficiently searched for concordancing and decoding purposes; (c) include a lexicon of word fragments that can be used to implement retrieval based on partial word matches; and (d) store the text's concordance in highly compressed form. All compression is based on the method of arithmetic coding, in conjunction with static models, derived from the text itself. This contrasts with contemporary stream-oriented compression techniques that use adaptive models, and with database compression techniques that use ad hoc codes rather than principled models. A number of design trade-offs are identified and investigated on a 2.7 million word sample of English text. The paper is intended to assist designers of full-text retrieval systems by defining, documenting and evaluating pertinent design decisions.

Keywords

Computer Science

URI

http://hdl.handle.net/1880/46181

Collections

Science Research & Publications

Full item page