TEXTUAL IMAGE COMPRESSION
Date
1991-11-01
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
We describe a method for lossless compression of images
that contain predominantly typed or typeset text--we call these
textual images. They are commonly found in facsimile documents,
where a typed page is scanned and transmitted as an image.
Another increasingly popular application is document archiving,
where documents are scanned by a computer and stored electronically
for later retrieval.
Our project was motivated by such an application: Trinity College in
Dublin, Ireland, are archiving their 1872 printed library catalogues
onto disk, and in order to preserve the exact form of the original
document, pages are being stored as scanned images rather than being
converted to text. Our test images are taken from this catalogue (one
is shown in Figure 1). These beautifully typeset documents have a rather
old-fashioned look, and contain a wide variety of symbols from several
different typefaces--the five test images we used contain text in English,
Flemish, Latin and Greek, and include italics and small capitals as well
as roman letters. The catalogue also contains Hebrew, Syriac, and Russian
text.
The best lossless compression methods for both text and images base their
coding on "contexts"--a symbol is coded with regard to adjacent ones.
However, the contexts used for coding text usually extend over significantly
more characters than those used in images. In text compression, the best
methods make predictions based on up to three or four characters, while
with black-white images, the most effective contexts tend to have a radius
of just a few pixels.
One possibility for textual image compression is to perform optical character
recognition (OCR) on the text, and only transmit (or store) the ASCII (or
equivalent) codes for the characters, along with some information about their
position on the page. There are several problems with this. Considerable
computing power is required to recognize characters accurately, and even then
it is not completely reliable, particularly if unusual fonts, foreign languages
or mathematical expressions are being scanned. OCR systems can require
"training" to learn a new font, and an operator may have to adjust parameters
such as the contrast of the scan to ensure that errors are corrected and small
marks are removed from the page. Ironically, although the image may look
better, it is actually \fInoisier\fR, because it does not faithfully represent
the original image. Smudged or badly printed characters are replaced with
what the OCR system has interpreted them as, rather than leaving human viewers
to make their own interpretation. Dirt or ink-stains, which may have given
valuable clues to a researcher, are lost. Even the typeface may not be
reproduced accurately, affecting the look of the document. For typed business
letters, this sort of "noise" may be acceptable, even desirable, but for
archives where the interests of future readers are unknown, there is a strong
motivation to record the document as faithfully as possible.
The compression methods investigated here are noiseless, so the original
document can be reproduced exactly from its compressed form. This is done
by attempting to separate the text and noise in the document. The two
components are then compressed independently using a method appropriate
for each.
Description
Keywords
Computer Science