Reconstruct Text from PDFs

This guide demonstrates the accuracy of extracting text from a PDF. In this case, the extractor package extracts the text from the input PDF and then reconstructs it by writing the text for each page to a new PDF with the creator package.

The Reconstruct words fom PDF example displays the position of words in the reconstructed text PDF.

Note: Only text in a PDF will be reconstructed.

Sample input

PDF text to reconstruct

Before you begin

You should get your API key from your UniCloud account.

If this is your first time using UniPDF SDK, follow this guide to set up a local development environment.

Project setup

Clone the project repository

In your terminal, clone the examples repository. It contains the Go code we will be using for this guide.

git clone https://github.com/unidoc/unipdf-examples.git

Navigate to the extract folder in the unipdf-examples directory.

cd unipdf-examples/extract

Configure environment variables

Replace the UNIDOC_LICENSE_API_KEY with your API credentials from your UniCloud account.

Linux/Mac

export UNIDOC_LICENSE_API_KEY=PUT_YOUR_API_KEY_HERE

Windows

set UNIDOC_LICENSE_API_KEY=PUT_YOUR_API_KEY_HERE

How it works

Lines 14-22 import the UniPDF packages and other required dependencies.

Lines 24-31 authenticate your request with your UNIDOC_LICENSE_API_KEY with the init function.

The main function in lines 33-47 validates your input and passes it as an argument to the reconstruct function.

Lines 49-103 define the reconstruct function, which takes the inputPath as an argument. The extractor package extracts text from each page of the PDF, reconstructs the text, and writes it page by page to the output PDF, reconst.pdf.

Run the code

Run this command to reconstruct the text in a PDF. This will also get all the required dependencies to run the program.

go run reconstruct_text.go input.pdf

Sample output

You will get text per page from the input file as a PDF. The created PDF is similar to the input PDF except that the created PDF contains text only.

Reconstructed Text

Got any Questions?

We're here to help you.