Reconstruct Text from PDFs
This guide demonstrates the accuracy of extracting text from a PDF. In this case, the extractor package extracts the text from the input PDF and then reconstructs it by writing the text for each page to a new PDF with the creator package.
The Reconstruct words fom PDF example displays the position of words in the reconstructed text PDF.
Note: Only text in a PDF will be reconstructed.
Sample input
Before you begin
You should get your API key from your UniCloud account.
If this is your first time using UniPDF SDK, follow this guide to set up a local development environment.
Project setup
Clone the project repository
In your terminal, clone the examples repository. It contains the Go code we will be using for this guide.
git clone https://github.com/unidoc/unipdf-examples.git
Navigate to the extract
folder in the unipdf-examples directory.
cd unipdf-examples/extract
Configure environment variables
Replace the UNIDOC_LICENSE_API_KEY
with your API credentials from your UniCloud account.
Linux/Mac
export UNIDOC_LICENSE_API_KEY=PUT_YOUR_API_KEY_HERE
Windows
set UNIDOC_LICENSE_API_KEY=PUT_YOUR_API_KEY_HERE
How it works
Lines 14-22
import the UniPDF packages and other required dependencies.
Lines 24-31
authenticate your request with your UNIDOC_LICENSE_API_KEY
with the init function.
The main function in lines 33-47
validates your input and passes it as an argument to the reconstruct
function.
Lines 49-103
define the reconstruct
function, which takes the inputPath as an argument. The extractor package extracts text from each page of the PDF, reconstructs the text, and writes it page by page to the output PDF, reconst.pdf
.
Run the code
Run this command to reconstruct the text in a PDF. This will also get all the required dependencies to run the program.
go run reconstruct_text.go input.pdf
Sample output
You will get text per page from the input file as a PDF. The created PDF is similar to the input PDF except that the created PDF contains text only.