Reconstruct PDF from hOCR

This guide explains how to use UniPDF's OCR capabilities to get hOCR output and reconstruct a PDF with selectable text.

This guide demonstrates an advanced use case of UniPDF’s OCR feature: reconstructing a PDF with selectable text from a scanned PDF. This is achieved by extracting images from the PDF, getting hOCR output from the OCR server, and then creating a new PDF with the text positioned correctly.

hOCR is a standard format for representing OCR output, containing text and layout information.

Before you begin

You should get your API key from your UniCloud account.

If this is your first time using UniPDF SDK, follow this guide to set up a local development environment.

You also need to have the UniDoc OCR server running. First, clone the ocrserver repository:

git clone https://github.com/unidoc/ocrserver.git

Then, navigate to the ocrserver directory and run the server using Docker Compose:

cd ocrserver
docker-compose up

Clone the project repository

In your terminal, clone the examples repository. It contains the Go code we will be using for this guide.

git clone https://github.com/unidoc/unipdf-examples.git

Navigate to the ocr folder in the unipdf-examples directory.

cd unipdf-examples/ocr

How it works

The reconstruct_pdf_from_hocr.go example shows how to reconstruct a PDF from hOCR output.

Data Structures for Parsing hOCR

The code defines several structs to represent the hierarchical structure of an hOCR document. This makes it easier to parse the XML-based hOCR output.

  • BBox: Represents a bounding box with X0, Y0, X1, Y1 coordinates.
  • TitleAttributes: Contains various attributes parsed from the title field of an hOCR element, such as the bounding box, baseline, font size, and word confidence.
  • OCRWord, OCRLine, OCRPar, OCRCArea, OCRPage: These structs represent the different levels of the hOCR hierarchy, from a single word to a full page. They are annotated with XML tags to allow for easy unmarshalling from the hOCR XML.
  • HOCRDocument: The root of the hOCR document.

Parsing the “title” Attribute

The ParseTitleAttributes function uses regular expressions to parse the title attribute of an hOCR element. This attribute contains a wealth of information about the recognized text, such as its bounding box, baseline, and font size. The function extracts this information and populates a TitleAttributes struct.

Main Program Flow

  1. Loading Images (loadImages function): The program starts by extracting all the images from the input PDF file. The loadImages function uses UniPDF’s extractor package to get the images from each page. It also handles page rotation to ensure the images are correctly oriented.

  2. Processing Each Image (processImage function): Each extracted image is then sent to the OCR server. The processImage function configures the OCR request to ask for hOCR output by setting the format form field to hocr.

  3. Parsing the hOCR Output: The JSON response from the server contains the hOCR data in the result field. This data is first extracted from the JSON and then unmarshalled into the HOCRDocument struct using Go’s encoding/xml package.

  4. Writing the Reconstructed PDF (writeContentAsPDF function): This is where the new PDF is created.

    • A new creator.Creator object is instantiated.
    • The page size is set to match the dimensions of the original page, which are obtained from the title attribute of the ocr_page element in the hOCR data.
    • The code then iterates through the parsed hOCR structure (areas, paragraphs, lines, and words).
    • For each word, a new creator.StyledParagraph is created.
    • The position of the paragraph is set using the bounding box coordinates from the hOCR data. This ensures the text appears in the correct location on the page.
    • The font size is also set based on the information from the hOCR data.
    • The styled paragraph is then added to a creator.Division, which is drawn onto the page.
    • Finally, the creator.Creator writes the complete PDF to a file.

The main function orchestrates this whole process, iterating through the pages of the input PDF, processing each image, and writing the reconstructed PDF pages to the output directory.

Run the code

Run the code with a scanned PDF file as input:

go run reconstruct_pdf_from_hocr.go /path/to/your/scanned.pdf

This will create a new PDF file in the output directory for each page of the input PDF. The new PDF will have the extracted text overlaid on the original images, making the text selectable and searchable.

Got any Questions?

We're here to help you.