Table Extraction
This guide shows you how to extract tables from one or more PDFs and save the tables to a csv file. It is important to know that PDF does not store tables as tables; rather, text is stored as characters (glyphs) with their own positions, and lines are drawn as individual shapes.
UniPDF extracts tables from PDF documents by detecting lines, gaps, text proximity, and other properties.
Sample input
Before you begin
You should get your API key from your UniCloud account.
If this is your first time using UniPDF SDK, follow this guide to set up a local development environment.
Project setup
Clone the project repository
In your terminal, clone the examples repository. It contains the Go code we will be using for this example.
git clone https://github.com/unidoc/unipdf-examples.git
Navigate to the extract
folder in the unipdf-examples directory.
cd unipdf-examples/extract
Configure environment variables
Replace the UNIDOC_LICENSE_API_KEY
with your API credentials from your UniCloud account.
Linux/Mac
export UNIDOC_LICENSE_API_KEY=PUT_YOUR_API_KEY_HERE
Windows
set UNIDOC_LICENSE_API_KEY=PUT_YOUR_API_KEY_HERE
How it works
Lines 9-32
import the UniPDF packages and other required dependencies.
Lines 34-41
authenticate your request with your UNIDOC_LICENSE_API_KEY
with the init function.
The main function in lines 47-124
validates your input and passes it as arguments to the extractTables
function.
Lines 126-459
create the extractTables
function, accepting the inputPath and firstPage and lastPage as optional arguments. The extractor package extracts tables from each page in the PDF and save each table to a csv file.
Run the code
Run this command to extract tables from the PDF. This will also get all the required dependencies to run the program.
#run this to extract tables from all the pages in a PDF
go run pdf_tables.go input.pdf
#run this to customize the output
go run pdf_tables.go [options] <file1.pdf> <file2.pdf> ...
Sample output
You will get tables for each page in the PDF saved in the outcsv
folder.