How to Transcribe documents

Scanning documents, journals and books to create PDFs on the Marxists Internet Archive [v2.0]

Introduction

When the MIA started out in the mid-1990s, bandwidth was limited to a few thousand “baud” generally, using dial up modems. For this reason the larger files the now ubiquitous PDFs presented were basically banned from the MIA (and many other web sites of the time). With the advent almost universally of broadband technology, file size is no longer as much as concern as it once was.

“PDF” stands for “Portable Document Format” is widely used now for everything from legal contracts (the reason it was actually developed for) to all manner of text and graphics. Unlike a “Web page”, designed using HTML, PDFs are not “text based”. All HTML pages are essentially the same as a simple unformatted text page. PDFs are basically graphic images of text though created PDFs “as text” is certainly possible and is done often. PDFs have in many respects become the basic word processing for many entities, equal to that of “Word” (docx) and other word processing formats.

The MIA is still “web based” and there are lot reasons for this continue as all landing pages on the Internet are still composed of HTML type based text files. It will likely remain this way for some time, even as HTML and “Markup Language” evolves.

Many thousands of documents on the MIA for various writers are in PDF format now and most periodicals are presented in PDF format. It is likely this will continue. There are many advantages to using HTML over PDF in developing writers libraries, the core of the Marxists Internet Archive. This will also continue, as it should, for some time. But we are no longer an HTML-only site. We incorporate PDFs extensively, especially as noted above, in our newspaper/journal pages of which there are many. So this tutorial is really going to focus on how to scan documents and papers for creating PDFs.

The meat of the issue

Creating PDFs can be done different ways. Almost all of the documents on the MIA’s massive Eugene V. Debs Internet Archive https:// www.marxists.org/archive/debs/ are in PDF format. They were not scanned in, they were keyed in by Tim Davenport on his computer. This created excellently formatted documents that are relatively (to graphic based scanned in PDFs) small. We do not expect volunteers for the MIA to do this though if that is their choice, it is up to them. An argument can be made that if one were willing to key in one key stroke at a time a book or magazine article, one could just as easily make an HTML web page. However, that is not how most PDFs on the MIA, especially those made from scanning newspapers and other periodicals are created. To explain this means one has to accept some generalities about scanning. This is because adjustments to scanning are software dependent to whatever scanner the volunteer is using. And there are many, many scanning models and manufacturers, each with their own “drivers” or software scanning software.

So…let us begin with these generalities.

A tabloid size newspaper requires a tabloid sized flat bed scanner. That is, one that is 11 inches/28 cm by 17 inches/43 cm. Such scanners are of course more expensive than their letter sized counterparts. But excellent ones, both used and new, can be had for under $1000 USD. It requires, obviously, a level of dedication to acquire and spend this much money. But if one is going to scan the huge number of such papers in the world of Marxism and socialism, then one should be quite dedicated from the getgo! What applies to this size scanner also applies to creating PDFs with letter sized ones. A note here: try to acquire, of whatever size, single-use scanners. Many scanners are marketed as “all in one” fax, copy, scanners. Avoid these as their software is vastly inferior to the software one gets with even cheap dedicated flatbed scanners.

Black & White, Grayscale, or Color?

Most newspapers were printed as black and white. Thus, reproducing them requires you set the scanner software to black and white (“B&W”). The default for many scanners software, even the best ones, are for color. Many volunteers believe that the aged, “toned” (browning through oxidation) pages of old papers should be reproduced. No, this aging should NOT be reproduced. Scans should reflect the original or close to original format and color as possible. A normal 100k scanned page of a newspaper with software settings for B&W will if scanned in color be approx. 2.4mb in size. The same would hold true for pamphlets and books if scanning these for PDFs. Always use B&W unless the situation calls for some color scanning.

If a periodical has lots of photographs, experiment with switching the software between B&W and grayscale. Grayscale (GS) can deliver cleaner less fuzzy looking photographs than B&W. Pages within the same document can be intermingled between B&W and GS, this is not a problem. But only use GS if there are many photographs that justify it. Most books going up on the MIA have no photos at all and should remain scanned as B&W.

Color and Spot Color

Many post 1940 era tabloid socialist journals and, many or most in “digest” sized magazines use “spot color”. Spot color is when some of the text is published in something other than black. Or a block of color is used with “white” letters or something similar. Often spot color is used only on the front and inside cover of the magazine or newspaper perhaps the rear cover as well since the front and back pages are often printed together as one sheet. It is OK to in fact scan these pages in color and then switch back to B&W. It is up the volunteer’s choice to do this and we place no recommendation here. Just remember to switch back to B&W after scanning the color.

In a few cases newspapers or magazines are published with full color photographs. This is very exceptional but it does happen. It makes sense at that point to consider scanning to turn these papers into full color PDFs. But even here we do not recommend it. The MIA’s emphasis has always been on text, not format (layout, styles, etc). Remember scanning anything in “24 bit color” will result in PDFs that are 24 times the size of the same scan done in B&W.

Paper size

This is very important. In some software the default settings are for letter size. But not all magazines are in fact letter sized. Some are digest size which can vary from 5x7 inches to 7x10 inches. If the software allows it, please set the “paper size” you are scanning to as close as possible to the actual size of the material you are scanning. The same holds true for tabloid or larger journals you choose to scan. If you are lucky enough to acquire a tabloid size scanner, such as the Epson GT-2000 the software for these scanners often indicates the default of letter size material. Please check. Tabloid is a general term for paper that is approximately 11” x 17” though it can vary as much an inch in either direction. As noted above, these figures equate to 28cm x 48cm [279mm x 432mm] using metric numbers. In the UK this equates often to “A3” size paper. Again, these are not exact measurements.

DPI/Dots per inch

DPI refers simply to the actual resolution as measured in “dots” and then averaged out over a square inch. A scanner has a kind of “base” DPI that can be increased by software magic to increase that number. The lowest one will find on very cheap scanners is about 300dpi. Most newer ones have a base scan capability of 400dpi to 600dpi. Again, software can increase this but we recommend you don’t do this. Stay at the maximum base resolution of the scanner you acquire.

People will view the text of any book, pamphlet, or journal on their computer screens which are never more than 72dpi to about 100dpi on some models of computer screens. One can learn a lot more by reading this: https://largeprinting.com/resources/image-resolution-and-dpi.html So why should we be scanning at 400dpi and higher? This is for four basic reasons:

First, while likely most users of MIA material will view the material on the computer, much of this text is very small, often 9pt Times-Roman, which is very common in older material. This means that often enough people will increase their viewing size by “zooming in” from 100% to, say, 150% or more to read the small text. The higher the DPI resolution the less “fuzzy” or blurred the text will appear on the computer screen when a user zooms in.

Secondly, printing. When PDF was developed, the main purpose was to the ability of the user to print exactly what is seen on the screen. This is why the legal profession was the first one to adopt PDFs as the standard for legal documentation. The higher the resolution, the more smooth the text and, photos will be when printing. We fully expect that 99.9% of MIA users will view material on the screen. Please see the fourth point below, on art work.

Thirdly, Optical Character Recognition or OCR. This a software tool that can “read text” in a PDF document and can embed it in the PDF. This allows for copying and pasting of selected text in a PDF. OCR is an extremely valuable tool. Instead of just a flat but readable and printable image on the screen, including OCR allows the user to select and copy all or part of the visible text and paste it into a word processor. The software that makes up an OCR program has to read each individual letter in a PDF image or even smaller the individual pixels making up that character. The higher the resolution, the finer and few errors the OCR will make when converting the image of the letters into actually text. On the MIA, some volunteers will “harvest” the text from PDFs and create web pages for individual writers from a journal. It is how most of the text is presented on the MIA, via OCRing of various texts, often directly from PDFs.

Some scanning software does this for you. Often people open up the PDF in Acrobat and use this tool to OCR the recently created PDF and embed that text in the document.

Fourthly, art work. Drawings, cartoons, and other forms of illustration look better on both the screen and when printed when scanned at the highest possible resolution. Some of the art work on the MIA is scanned at 1200dpi. Not whole newspapers or books, but at least individual pages where there is a cartoon or other illustrations. Please consider this when creating art-heavy PDFs for the MIA.

Saving as PDF

There are different kinds of PDF format. The standard, default one (plain, ordinary "pdf") is the one preferred by MIA. When using scanning software, or a word or text processor, this setting should be the one used. Adobe Acrobat, the premier and original PDF creation tool, does give one options in this area. Please avoid them. In particular, please do not use "pdf/A" format. Some of the software, after scanning several pages, lets one save the document as a PDF. Please do so and make no other adjustments.

Three Additional short takes on tools, utilities, operating systems...

1. Creating PDFs of books using Linux: a mini-tutorial

You can scan most books or pamphlets using a scanner built into a multi-purpose printer, and the resolution is normally high enough to be usable for OCR. Be aware that if your book is longer than a couple of hundred pages you are very likely to have to break the spine while flattening the book to scan the centre of the pages. OCR is very sensitive to orientation; it is important to keep the book at right angles to the scanner all the way through.

The scanning driver software with the largest coverage of different scanners is XSane, which should be available on any Linux distribution. Once scanned, check the quality of the output: if the centre of the pages is poorly scanned it may help the OCR to split each pair of pages into two using graphical software such as ImageMagick. Use the 'convert' utility from ImageMagick to join the page images together to make the initial PDF. A PDF created like this may be very large; you can shrink it to a reasonable size using ghostscript's 'epub' option for quality (this is good enough for OCR).

The best OCR software available for Linux is Tesseract: the output from this is now similar in quality to the commercial Abisoft, but make sure you have a recent version and are using the right language data for your text. Tesseract can be used to generate text which will be used for HTML (after proofreading); but once it is installed, you should also use OCRmyPDF to add OCR-ed text from Tesseract to the PDF document you have created, making it searchable.

The final stage in creating the PDF is adding the metadata, which should include the correct author and title, as well as a link to MIA in the keywords field. MIA does not claim copyright over PDFs created this way; this is just for information about the source.

If a PDF created this way is of reasonable quality, link to the PDF from the front page of the HTML version; there is no reason not to have both.

2. Short takes on bringing over PDFs from archive.org

PDFs from archive.org are usually good quality with embedded OCR already. But they also often have many unwanted front and back pages with library cards, scribbles by previous owners, etc. These should be trimmed before the PDF is added to MIA (on Linux this is easily done using the pdftk utility). The original metadata from archive.org should be left intact, but you can add a MIA URL to the keywords.

3. PDFs from 'the internet'

There are many PDFs of scanned books available on the internet which don't have OCR already. Most of these are in copyright, but if not, they may be usable for MIA. Use OCRmyPDF to add text, or the utility pdfimages to split the pdf into separate image pages, which can then be treated as if you had scanned them yourself. The metadata will normally need adding from scratch.

Contact the Marxists Internet Archive Admin Committee for further information