Reconstruct PDF from hOCR
This guide explains how to use UniPDF's OCR capabilities to get hOCR output and reconstruct a PDF with selectable text.
This guide demonstrates an advanced use case of UniPDF’s OCR feature: reconstructing a PDF with selectable text from a scanned PDF. This is achieved by extracting images from the PDF, getting hOCR output from the OCR server, and then creating a new PDF with the text positioned correctly.
hOCR is a standard format for representing OCR output, containing text and layout information.
Before you begin
You should get your API key from your UniCloud account.
If this is your first time using UniPDF SDK, follow this guide to set up a local development environment.
You also need to have the UniDoc OCR server running.
First, clone the ocrserver repository:
git clone https://github.com/unidoc/ocrserver.git
Then, navigate to the ocrserver directory and run the server using Docker Compose:
cd ocrserver
docker-compose up
Clone the project repository
In your terminal, clone the examples repository. It contains the Go code we will be using for this guide.
git clone https://github.com/unidoc/unipdf-examples.git
Navigate to the ocr folder in the unipdf-examples directory.
cd unipdf-examples/ocr
How it works
The reconstruct_pdf_from_hocr.go example shows how to reconstruct a PDF from hOCR output.
Data Structures for Parsing hOCR
The code defines several structs to represent the hierarchical structure of an hOCR document. This makes it easier to parse the XML-based hOCR output.
BBox: Represents a bounding box withX0,Y0,X1,Y1coordinates.TitleAttributes: Contains various attributes parsed from thetitlefield of an hOCR element, such as the bounding box, baseline, font size, and word confidence.OCRWord,OCRLine,OCRPar,OCRCArea,OCRPage: These structs represent the different levels of the hOCR hierarchy, from a single word to a full page. They are annotated with XML tags to allow for easy unmarshalling from the hOCR XML.HOCRDocument: The root of the hOCR document.
Parsing the “title” Attribute
The ParseTitleAttributes function uses regular expressions to parse the title attribute of an hOCR element. This attribute contains a wealth of information about the recognized text, such as its bounding box, baseline, and font size. The function extracts this information and populates a TitleAttributes struct.
Main Program Flow
Loading Images (
loadImagesfunction): The program starts by extracting all the images from the input PDF file. TheloadImagesfunction uses UniPDF’sextractorpackage to get the images from each page. It also handles page rotation to ensure the images are correctly oriented.Processing Each Image (
processImagefunction): Each extracted image is then sent to the OCR server. TheprocessImagefunction configures the OCR request to ask for hOCR output by setting theformatform field tohocr.Parsing the hOCR Output: The JSON response from the server contains the hOCR data in the
resultfield. This data is first extracted from the JSON and then unmarshalled into theHOCRDocumentstruct using Go’sencoding/xmlpackage.Writing the Reconstructed PDF (
writeContentAsPDFfunction): This is where the new PDF is created.- A new
creator.Creatorobject is instantiated. - The page size is set to match the dimensions of the original page, which are obtained from the
titleattribute of theocr_pageelement in the hOCR data. - The code then iterates through the parsed hOCR structure (areas, paragraphs, lines, and words).
- For each word, a new
creator.StyledParagraphis created. - The position of the paragraph is set using the bounding box coordinates from the hOCR data. This ensures the text appears in the correct location on the page.
- The font size is also set based on the information from the hOCR data.
- The styled paragraph is then added to a
creator.Division, which is drawn onto the page. - Finally, the
creator.Creatorwrites the complete PDF to a file.
- A new
The main function orchestrates this whole process, iterating through the pages of the input PDF, processing each image, and writing the reconstructed PDF pages to the output directory.
Run the code
Run the code with a scanned PDF file as input:
go run reconstruct_pdf_from_hocr.go /path/to/your/scanned.pdf
This will create a new PDF file in the output directory for each page of the input PDF. The new PDF will have the extracted text overlaid on the original images, making the text selectable and searchable.