MODELS FOR COMPRESSION IN FULL-TEXT RETRIEVAL SYSTEMS

Witten, Ian H.; Nevill, Craig G.; Bell, Timothy C.

MODELS FOR COMPRESSION IN FULL-TEXT RETRIEVAL SYSTEMS

dc.contributor.author	Witten, Ian H.	eng
dc.contributor.author	Nevill, Craig G.	eng
dc.contributor.author	Bell, Timothy C.	eng
dc.date.accessioned	2008-02-27T22:29:02Z
dc.date.available	2008-02-27T22:29:02Z
dc.date.computerscience	1999-05-27	eng
dc.date.issued	1990-08-01	eng
dc.description.abstract	Text compression systems operate in a stream-oriented fashion which is inappropriate for databases that need to be accessed through a variety of retrieval mechanisms. This paper develops models for full-text retrieval systems which (a) compress the main text so that it can be randomly accessed via synchronization points; (b) store the text's lexicon in a compressed form that can be efficiently searched for concordancing and decoding purposes; (c) include a lexicon of word fragments that can be used to implement retrieval based on partial word matches; and (d) store the text's concordance in highly compressed form. All compression is based on the method of arithmetic coding, in conjunction with static models, derived from the text itself. This contrasts with contemporary stream-oriented compression techniques that use adaptive models, and with database compression techniques that use ad hoc codes rather than principled models. A number of design trade-offs are identified and investigated on a 2.7 million word sample of English text. The paper is intended to assist designers of full-text retrieval systems by defining, documenting and evaluating pertinent design decisions.	eng
dc.description.notes	We are currently acquiring citations for the work deposited into this collection. We recognize the distribution rights of this item may have been assigned to another entity, other than the author(s) of the work.If you can provide the citation for this work or you think you own the distribution rights to this work please contact the Institutional Repository Administrator at digitize@ucalgary.ca	eng
dc.identifier.department	1990-403-27	eng
dc.identifier.doi	http://dx.doi.org/10.11575/PRISM/31172
dc.identifier.uri	http://hdl.handle.net/1880/46181
dc.language.iso	Eng	eng
dc.publisher.corporate	University of Calgary	eng
dc.publisher.faculty	Science	eng
dc.subject	Computer Science	eng
dc.title	MODELS FOR COMPRESSION IN FULL-TEXT RETRIEVAL SYSTEMS	eng
dc.type	unknown
thesis.degree.discipline	Computer Science	eng