Browsing by Author "Bell, Timothy C."
Now showing 1 - 8 of 8
Results Per Page
Sort Options
Item Open Access COMPRESSING THE DIGITAL LIBRARY(1994-03-01) Bell, Timothy C.; Moffat, Alistair; Witten, Ian H.The prospect of digital libraries presents the challenge of sto ring vast amounts of information efficiently and in a way that facilitates rapid search and retrieval. Storage space can be reduced by appropriate compression techniques, and searching can be enabled by constructing a full-text index. But these two requirements are in conflict: the need for decompression increases access time, and the need for an index increases space requirements. This paper resolves the conflict by showing how (a) large bodies of text can be compressed and indexed into less than half the space required by the original text alone, (b) full-text queries (Boolean or ranked) can be answered in small fractions of a second, and (c) documents can be decoded at the rate of approximately one megabyte a second. Moreover, a document database can be compressed and indexed at the rate of several hundred megabytes an hour.Item Open Access INDEXING AND COMPRESSING FULL-TEXT DATABASES FOR CD-ROM(1990-11-01) Witten, Ian H.; Bell, Timothy C.; Nevill, Craig G.CD-ROM is an attractive delivery vehicle for full-text databases. Because of large storage capacity and low access speed, carefully-designed indexing structures--including a concordance--are necessary to enable the text to be retrieved efficiently. However, the indexes are sufficiently large that they tax the ability of main store to hold them when processing queries. The use of compression techniques can substantially increase the volume of text that a disk can accommodate, and substantially decrease the amount of primary storage needed to hold the indexes. This paper describes a suitable indexing mechanism, and its compression potential using modern compression methods. It is possible to double the amount of text that can be stored on a CD-ROM disk \fIand\fR include a full concordance and indexes as well. A single disk can accommodate around 180 million words of text--equivalent to a library of 1000-1500 books--and provide rapid response to a variety of queries involving multiple search terms and word fragments.Item Open Access MODELING FOR TEXT COMPRESSION(1988-10-25) Bell, Timothy C.; Witten, Ian H.; Cleary, John G.The best schemes for text compression employ large models to help them predict which characters will come next. The actual next characters are coded with respect to the prediction, resulting in compression of information. Models are best formed adaptively, based on the text seen so far. This paper surveys successful strategies for adaptive modeling which are suitable for use in practical text compression systems. The strategies fall into three main classes: finite-context modeling, in which the last few characters are used to condition the probability distribution for the next one; finite-state modeling, in which the distribution is conditioned by the current state (and which subsumes finite-context modeling as an important special case); and dictionary modeling, in which strings of characters are replaced by pointers into an evolving dictionary. A comparison of different methods on the same sample texts is included, along with an analysis of future research directions.Item Open Access Item Open Access MODELS FOR COMPRESSION IN FULL-TEXT RETRIEVAL SYSTEMS(1990-08-01) Witten, Ian H.; Nevill, Craig G.; Bell, Timothy C.Text compression systems operate in a stream-oriented fashion which is inappropriate for databases that need to be accessed through a variety of retrieval mechanisms. This paper develops models for full-text retrieval systems which (a) compress the main text so that it can be randomly accessed via synchronization points; (b) store the text's lexicon in a compressed form that can be efficiently searched for concordancing and decoding purposes; (c) include a lexicon of word fragments that can be used to implement retrieval based on partial word matches; and (d) store the text's concordance in highly compressed form. All compression is based on the method of arithmetic coding, in conjunction with static models, derived from the text itself. This contrasts with contemporary stream-oriented compression techniques that use adaptive models, and with database compression techniques that use ad hoc codes rather than principled models. A number of design trade-offs are identified and investigated on a 2.7 million word sample of English text. The paper is intended to assist designers of full-text retrieval systems by defining, documenting and evaluating pertinent design decisions.Item Open Access SOURCE MODELS FOR NATURAL LANGUAGE(1988-10-01) Bell, Timothy C.; Witten, Ian H.A model of natural language is a collection of information that approximates the statistics and structure of the language being modeled. The purpose of the model may be to give insight into rules which govern how text is generated, or to predict properties of future samples of the language. This paper studies models of natural language from three different, but related, viewpoints. First, we examine the statistical regularities that are found empirically, based on the natural units of words and letters. Second, we study theoretical models of language, including simple random generative models of letters and words whose output, like genuine natural language, obeys Zipf's law. Innovation in text is also considered by modeling the appearance of previously unseen words as a Poisson process. Finally, we review experiments that estimate the information content inherent in natural text.Item Open Access TEXTUAL IMAGE COMPRESSION(1991-11-01) Witten, Ian H.; Bell, Timothy C.; Harrison, Mary-Ellen; James, Mark L.; Moffat, AlistairWe describe a method for lossless compression of images that contain predominantly typed or typeset text--we call these textual images. They are commonly found in facsimile documents, where a typed page is scanned and transmitted as an image. Another increasingly popular application is document archiving, where documents are scanned by a computer and stored electronically for later retrieval. Our project was motivated by such an application: Trinity College in Dublin, Ireland, are archiving their 1872 printed library catalogues onto disk, and in order to preserve the exact form of the original document, pages are being stored as scanned images rather than being converted to text. Our test images are taken from this catalogue (one is shown in Figure 1). These beautifully typeset documents have a rather old-fashioned look, and contain a wide variety of symbols from several different typefaces--the five test images we used contain text in English, Flemish, Latin and Greek, and include italics and small capitals as well as roman letters. The catalogue also contains Hebrew, Syriac, and Russian text. The best lossless compression methods for both text and images base their coding on "contexts"--a symbol is coded with regard to adjacent ones. However, the contexts used for coding text usually extend over significantly more characters than those used in images. In text compression, the best methods make predictions based on up to three or four characters, while with black-white images, the most effective contexts tend to have a radius of just a few pixels. One possibility for textual image compression is to perform optical character recognition (OCR) on the text, and only transmit (or store) the ASCII (or equivalent) codes for the characters, along with some information about their position on the page. There are several problems with this. Considerable computing power is required to recognize characters accurately, and even then it is not completely reliable, particularly if unusual fonts, foreign languages or mathematical expressions are being scanned. OCR systems can require "training" to learn a new font, and an operator may have to adjust parameters such as the contrast of the scan to ensure that errors are corrected and small marks are removed from the page. Ironically, although the image may look better, it is actually \fInoisier\fR, because it does not faithfully represent the original image. Smudged or badly printed characters are replaced with what the OCR system has interpreted them as, rather than leaving human viewers to make their own interpretation. Dirt or ink-stains, which may have given valuable clues to a researcher, are lost. Even the typeface may not be reproduced accurately, affecting the look of the document. For typed business letters, this sort of "noise" may be acceptable, even desirable, but for archives where the interests of future readers are unknown, there is a strong motivation to record the document as faithfully as possible. The compression methods investigated here are noiseless, so the original document can be reproduced exactly from its compressed form. This is done by attempting to separate the text and noise in the document. The two components are then compressed independently using a method appropriate for each.Item Open Access THE ZERO FREQUENCY PROBLEM: ESTIMATING THE PROBABILITIES OF NOVEL EVENTS IN ADAPTIVE TEXT COMPRESSION(1989-04-01) Witten, Ian H.; Bell, Timothy C.The zero-frequency problem is the problem of estimating the likelihood of a novel event occurring. It is important in adaptive statistical text compression because it is almost always necessary to reserve a small part of the code space for the unexpected (say, the appearance of a new word); the alternative of allocating code space to every possible event (say, a code for each ASCII character) invariably impairs coding efficiency since not all possible events actually occur. This paper reviews approaches that have been taken to the problem in adaptive text compression. Although several methods have been used, their suitability has been based on empirical evaluation rather than a well-founded model. We propose the application of a Poisson process model of novelty. Its ability to predict novel tokens is evaluated, and it consistently outperforms existing methods. It is also applied to a practical statistical coding scheme, where a slight modification is required to avoid divergence. The result is a well-founded zero-frequency model that explains observed differences in the performance of existing methods, and offers a small improvement in the coding efficiency of text compression over the best method previously known.