How to make searchable PDFs

A "searchable PDF" looks like a scanned document showing all images and text, AND the text can be searched. A non-searchable (or standard) PDF, shows all images and text, but the text is not in a format that can be searched.

To make a searchable PDF:

1. Scan your document to produce image files

LogoLogoLogo

  1. Take your original document, and use a scanner to produce image scans in JPG format.
  2. Make sure your scans are at least 300dpi (greyscale, colour if needed). Use 400-600dpi if the next is unusually small.

2. Clean up the scanned page images

Note that your PDF software may carry out this step, otherwise you will have to use image-editing softwarẹ

LogoLogoLogo

  1. The scanned page images will not look perfect
  2. De-skew the pages (straighten)
  3. Clean the pages: remove edges, artifacts, shadowing. Perhaps boost the contrast

3. Convert your image files to text + images

Your PDF creation software will do this.

LogoLogo + EDITABLE TEXT

  1. Since the page image is not editable and not searchable, you must process it to convert the image of the text, into editable text.
  2. In Acrobat PDF software (not Acrobat Reader), from the Tools menu, select "Recognise Text". In other software, it may be called something else, eg. OCR (Optical Character Recognition), or something to do with "searchable/editable text"
  3. It is important to be able to retain formatting, especially columns. You may be able to specify areas on a page where there is no text (ie. photos/diagrams)

4. Spellcheck and edit your text

After you have produced editable text from the images, you will have the option to edit and correct it, in particular:

  • Spelling. There will be lots of errors.
  • Remove extraneous text, especially where the software has tried to turn images into text

5. Produce PDFs

When you are done, save your text and images as "searchable PDFs". If this is done correctly, the editable text will be an invisible transparent layer on top of the scanned page image:

  1. The resulting PDF pages will look like your scanned page images
  2. The text will be searchable, and you will be able to copy and paste the transparent text layer

There is often an option to save your PDF as:

  1. Text only, in which case it will not look as good as the scanned image, and you may lose any pictures.
  2. Images only, in which case the PDF will not be searchable, and you will not be able to copy any text
  3. You want to save your PDFs as text+image, sometimes called "searchable PDFs"

 

Checklist

  1. Make sure your scans are at least 300dpi greyscale. If the text on the page is small, and you are not getting good text extraction, use 400-600dpi. If there are colour images, then scan as colour, otherwise stick with greyscale.
  2. Clean up the scanned images, as these will be displayed in the final PDF
  3. When you convert the scanned images to images + text, make sure you retain formatting such as columns. Otherwise when try and copy text, the top line from each column will concatonate, and so on down the page.
  4. Make sure you save as "seachable PDF" (sometimes called text + images).

Notes

Some software manages all stages of the process, and you will be able to save the "project" before completing all the steps. Save regularly!

Some software will let you do multiple steps at a time. ie. you won't have to scan one page, clean it up, extract the editable/searchable text, edit it, save it, and repeat. You'll be able to scan all page first, then clean them all up, then extract text from them all, edit them all, and save all the editable pages as one searchable PDF.

You may need separate software for (a) scanning (b) cleaning the images (c) extracting the text from the images, and saving as searchable PDFs.