MODELS FOR COMPRESSION IN FULL-TEXT RETRIEVAL SYSTEMS
Date
1990-08-01
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
Text compression systems operate in a stream-oriented fashion which is
inappropriate for databases that need to be accessed through a variety
of retrieval mechanisms. This paper develops models for full-text
retrieval systems which (a) compress the main text so that it can be
randomly accessed via synchronization points; (b) store the
text's lexicon in a compressed form that can be efficiently searched for
concordancing and decoding purposes; (c) include a lexicon of word fragments
that can be used to implement retrieval based on partial word matches; and
(d) store the text's concordance in highly compressed form. All
compression is based on the method of arithmetic coding, in conjunction with
static models, derived from the text itself. This contrasts with
contemporary stream-oriented compression techniques that use adaptive
models, and with database compression techniques that use ad hoc
codes rather than principled models.
A number of design trade-offs are identified and investigated on a 2.7
million word sample of English text. The paper is intended to assist
designers of full-text retrieval systems by defining, documenting and
evaluating pertinent design decisions.
Description
Keywords
Computer Science