How to Improve the Quality of Files Containing Scanned Text, Before Conversion to ASCII Text

The reason I felt that I should give these tips is that I often receive JPG files of newspaper clippings that people send to me.  They often are time-consuming to handle or process.  I'll virtually never pass JPG or other graphics files on to anyone else, for the simple reason that they use up time and money in proportion to how many people receive them.
    I don't like receiving them, because they tie up my telephone line, sometimes for a half hour or longer, when I download them.
    However, if the information contained in them is worth passing on, I'll attempt to convert the graphics files to text, and pass on the information in text format.  
   Always keep in mind that work you don't want to do will have to be done by others to whom you send those files.  If you want to save yourself an hour of work because you don't want to convert a JPG file containing text to text format, then everyone you send that file to will have to do an hour of work in consequence of that, or he'll junk the file.  Most likely he'll not pass the file on to anyone else.
    The problem with that is that just about any newspaper clipping will require to be sent as a JPG file about 700kB in size, if the recipient is not to do a heck of a lot of typing to translate the graphics files into text.  Anything below that size will not enable the optical character recognition software that can be used for the conversion to do its job.  At the best, it will be able to do its job only badly.  However, to pass on such large files is an imposition on the recipients.  You should not do it.  You should do the conversion to text.  Text files are vastly smaller than graphics files.
    Don't send large JPG files to people without asking them first whether they want to receive them.
    Here are some more detailed pointers.

Don't make the files too big.  Many people have only dial-up modems, and sometimes it takes a terribly long time to download large JPG files because of their excessive size.
    On the other hand, don't make the resolution for scanning too low.   For graphic-image-to-text conversion, the quality of the image must determine the file size, the file size must not affect the quality of the image.  The reason for that is that JPG files can be imported into optical character recognition software.  If the resolution is too low because the files size is too small, the OCR software can't interpret the shapes and then translate them into characters.
    And, by the way, the reason I mention JPG files is that that is what your scanner will produce if it is set up right.  Don't send TIF, bitmap (BMP) or GIF files.   The first will be too large because they are not compressed, and the second will be indigestible by the OCR software.
    Of course, the reason why you want to send the clippings is that you find it to be too much work to translate them to text yourself.  That may be because you don't know how to use OCR software (it usually comes with your scanner), or because you don't have any OCR software (it is available for free on the Net). 
    Still, if you don't give somebody else enough to work with, they can't have their software do the translating, and, given that few people have the time to spend on re-typing newspaper articles, they might just junk the files you sent, because they don't want to be so inconsiderate as to pass sub-standard files on to other people.
   JPG files are not the best means of transmitting text, text files are.   Zipped text files are even better.  Zipping or compressing a document will even substantially reduce the size of a document containing a large amount of formating by a factor of about three.  However, once you send a text file of a page or even a few pages, it is usually so small already, that it is hardly worth the effort to zip it.   For text files that contain document that are more than 50 kB in size, you might consider zipping them first before you send them out, especially word processor files.  Those will be substantially reduced in size by zipping.
    ZIP utilities are available for free.  It used to be easy to get them.  They were relatively easy to handle when people still had a bit of DOS skills, but now everybody uses GDI interfaces (e.g.: Windows) and few people know how to use DOS.   I'm not sure whether ZIP utilities for Windows are available for free.  I haven't looked for a while.  To find one for free may be more expensive than to buy WinZip for US$29 or so.
    About five full pages of text will require no more than a single file 9kB in size, whereas if you scan the five pages and send them off as individual JPG files, there'll be five of them, each being about 60 to 100kB in size and therefore of poor quality that will prevent them from being translated by OCR software to text.   Keep the JPG files to yourself, convert their contents to text, and you'll create a lot of good will instead of aggravation.

Here are some tips on how to get the best trade-offs between file size and the quality of the text they contain:  The tips apply whether you want to create files for yourself or for others.

  • Make sure to use colour scanning, not black/white scanning. The files may then be a bit larger (not very likely with newspaper articles), but they will be easier for the OCR software to interpret.

  • Try to adjust the contrast. Black or dark-grey on grey background doesn't work well for OCR software. Your scanner should have some features that allow you to adjust the quality of scanning in that regard..

  • Some scanners will have an option that allows you to turn on the OCR software for a scan, before you make the scan.  The results of that will most likely not be as good as they are when you use a JPG colour scan and then turn on the OCR software to convert the text to colour afterwards.

  • Very Important: Make sure the articles are vertically aligned, precisely!
        If necessary, use a little bit of masking tape to fix them in place before closing the lid on the scanner.
        Better yet, with single and crumpled pieces of newspaper clippings, it is a good idea to lay a book on top of the pieces, so as to weigh them down and to largely eliminate the crinkles.
        If the document you are scanning is not correctly aligned, vertically, the OCR software will not be very successful in interpreting the shapes of the characters.  Some scanning software has an option that will straighten out the scan before the OCR conversion, but by doing so some of the detail of the scan will be lost.   The edges of the characters will become a bit distorted, and the results of translating the shapes of the characters into ASCII code will be less than optimal.

By following those tips you'll find that the quality of the scans, especially as far as interpretability by OCR software is concerned, will improve by more than a factor of two.

The table shown farther down contains the results of some tests I ran with a newspaper article, using different scanning options.  As you can see from that, big file-size advantages will be gained by converting scans to plain text files.   Unless you do a lot of scanning, it will not be worth your effort to experiment with different combinations of settings. 
    The results vary with the quality of the original document, and the format of a document doesn't need to be reproduced exactly.  It's a lot of work to do that and not worth your time.  It's the information content of an article, not the format of an article, that counts.
    Probably the best results can as a rule be obtained if you set your scan for colour (RGB or True Colour) and the resolution at 200dpi  or 300 dpi (dots per inch).   Turn on your OCR software, to interpret the scan and have the image converted to ASCII text, and then save the results as a plain text file (use something like Notepad).
    The reason why the RTF files in the examples are mostly so big is because the header of the article I used could not be recognized as text.  The scanner defaulted that portion of the document to a graphics file, that is shown in the resulting RTF document.   Little is to be gained by that.
    In a plain text file, those portions of the document will be missing.   Therefore make sure that you type in the headlines as required.  In addition, you must always make sure to identify the name of the paper, the section, the date of the edition, and the page the article was on.  You should keep a JPG file of the article (or the original clipping), just in case somebody comes back to you for proof that the article was actually published.

Most of all, read the help files, tutorial or manual that came with your scanner.

Newspaper Article (6cm by 7cm; no enlargement)
File Sizes in kB after scanning
(or after translation to text)
ResolutionFile Type *Graphics
ZippedASCII TextZippedRTFZippedCharacter
75 dpiJPG, B/W, C212206none
75 dpiJPG, B/W, NC214208none
75 dpiJPG, Colour, C170159326793005.0%
75 dpiJPG, Colour, NC170159326833045.0%
100 dpiJPG, B/W, C3483372.0%
100 dpiJPG, B/W, NC3463362.0%
100 dpiJPG, Colour, C274256321,16250775.0%
100 dpiJPG, Colour, NC273256321,16750675.0%
150 dpiJPG, B/W, C708687none
150 dpiJPG, B/W, NC715693none
150 dpiJPG, Colour, C658617322,6541,20895.0%
150 dpiJPG, Colour, NC660620321,16750695.0%
200 dpiJPG, B/W, C1,1541,120none
200 dpiJPG, B/W, NC1,1651,131none
200 dpiJPG, Colour, C8738153212399.5%
200 dpiJPG, Colour, NC8778193211399.5%
300 dpiJPG, B/W, C2,4132,336none
300 dpiJPG, B/W, NC2,3192,313none
300 dpiJPG, Colour, C2,1051,9753210,2894,68299.9%
300 dpiJPG, Colour, NC2,1312,0013210,3654,74499.8%
More tips on scanning pages with text

White RoseThe White Rose
Thoughts are Free

Posted 2002 03 31
2004 09 06 (to install link to more tips on scanning text)