Browsing by Author "Moffat, Alistair"
Now showing 1 - 2 of 2
Results Per Page
Sort Options
Item Open Access COMPRESSING THE DIGITAL LIBRARY(1994-03-01) Bell, Timothy C.; Moffat, Alistair; Witten, Ian H.The prospect of digital libraries presents the challenge of sto ring vast amounts of information efficiently and in a way that facilitates rapid search and retrieval. Storage space can be reduced by appropriate compression techniques, and searching can be enabled by constructing a full-text index. But these two requirements are in conflict: the need for decompression increases access time, and the need for an index increases space requirements. This paper resolves the conflict by showing how (a) large bodies of text can be compressed and indexed into less than half the space required by the original text alone, (b) full-text queries (Boolean or ranked) can be answered in small fractions of a second, and (c) documents can be decoded at the rate of approximately one megabyte a second. Moreover, a document database can be compressed and indexed at the rate of several hundred megabytes an hour.Item Open Access TEXTUAL IMAGE COMPRESSION(1991-11-01) Witten, Ian H.; Bell, Timothy C.; Harrison, Mary-Ellen; James, Mark L.; Moffat, AlistairWe describe a method for lossless compression of images that contain predominantly typed or typeset text--we call these textual images. They are commonly found in facsimile documents, where a typed page is scanned and transmitted as an image. Another increasingly popular application is document archiving, where documents are scanned by a computer and stored electronically for later retrieval. Our project was motivated by such an application: Trinity College in Dublin, Ireland, are archiving their 1872 printed library catalogues onto disk, and in order to preserve the exact form of the original document, pages are being stored as scanned images rather than being converted to text. Our test images are taken from this catalogue (one is shown in Figure 1). These beautifully typeset documents have a rather old-fashioned look, and contain a wide variety of symbols from several different typefaces--the five test images we used contain text in English, Flemish, Latin and Greek, and include italics and small capitals as well as roman letters. The catalogue also contains Hebrew, Syriac, and Russian text. The best lossless compression methods for both text and images base their coding on "contexts"--a symbol is coded with regard to adjacent ones. However, the contexts used for coding text usually extend over significantly more characters than those used in images. In text compression, the best methods make predictions based on up to three or four characters, while with black-white images, the most effective contexts tend to have a radius of just a few pixels. One possibility for textual image compression is to perform optical character recognition (OCR) on the text, and only transmit (or store) the ASCII (or equivalent) codes for the characters, along with some information about their position on the page. There are several problems with this. Considerable computing power is required to recognize characters accurately, and even then it is not completely reliable, particularly if unusual fonts, foreign languages or mathematical expressions are being scanned. OCR systems can require "training" to learn a new font, and an operator may have to adjust parameters such as the contrast of the scan to ensure that errors are corrected and small marks are removed from the page. Ironically, although the image may look better, it is actually \fInoisier\fR, because it does not faithfully represent the original image. Smudged or badly printed characters are replaced with what the OCR system has interpreted them as, rather than leaving human viewers to make their own interpretation. Dirt or ink-stains, which may have given valuable clues to a researcher, are lost. Even the typeface may not be reproduced accurately, affecting the look of the document. For typed business letters, this sort of "noise" may be acceptable, even desirable, but for archives where the interests of future readers are unknown, there is a strong motivation to record the document as faithfully as possible. The compression methods investigated here are noiseless, so the original document can be reproduced exactly from its compressed form. This is done by attempting to separate the text and noise in the document. The two components are then compressed independently using a method appropriate for each.