Concurrent Text Extraction
This guide will demonstrate how to extract text from multiple documents concurrently by taking advantage of the document-level concurrency provided in UniPDF.
Before you begin
You should get your API key from yourUniCloud account.
If this is your first time using UniPDF SDK, follow this guide to set up a local development environment.
Clone the project repository
In your terminal, clone the examples repository. It contains the Go code we will be using for this guide.
git clone https://github.com/unidoc/unipdf-examples.git
Navigate to the concurrent-processing
folder in the unipdf-examples
directory.
cd unipdf-examples/concurrent-processing
How it works
Lines 11-22
import the necessary packages from unipdf and the standard Go library. Then the init
function in lines 24-31
initializes the package by setting the metered license key using license.SetMeteredKey(os.Getenv(`UNIDOC_LICENSE_API_KEY`))
.
The main
function in lines 33-53
runs the whole extraction process. In lines 34-37
of this function, it checks the minimum command line arguments. Then, in lines 38-47
it populates the inputDocuments
and outputDir
by taking the values provided via the command line arguments. Line 49
starts a timer and line 50
calls the runConcurrent
function using the appropriate arguments. Then, lines 51-52
stops the timer and displays the time taken to process the documents, respectively.
The runConcurrent
function defined in lines 56-89
accepts a list of PDF documents and an output directory specified by the parameters documents
and outputDir
, respectively and processes all the documents concurrently.
First, it creates a channel of type map[string]string
with a buffer length the same as the number of documents. This is enables the go routines to process the documents without waiting for the result to be read on the other end of the channel. Then documents and the channel are sent to the concurrent processor using concurrentExtraction(documents, res)
call in line 59
.
In lines 60-62
, any error returned from the call is logged. Lines 63-70
create the output directory if it doesn’t exist. In lines 73-88
, using the outer loop, all the results written to the channel are read. The inner loop encompassing lines 75-87
, is used to unpack the key and value of the map, which are the file path and document content written to the channel. The code in lines 76-86
just creates text files inside the output directory with the same name as input files and writes the respective content to the text files.
The concurrentExtraction
function, which is called by the previous function, is where the concurrency is implemented.Here, it iterates through each PDF document and launches a go routine that extracts the text from the document by calling the extractSingleDoc
function. In the closure function defined in lines 96-85
, first the result is obtained in line 97
. Then a temporary map object is created in line 101-103
with the file path as a key and the extracted text as a value, using:
temp := map[string]string{
filePath: result,
}
Then this temp
map is written to the channel in line 104
.
Finally, the function that was called for each document is defined in lines 112-142
. This function uses model.NewPdfReaderFromFile
method to get the pointer to a new model.PdfReader
object from the provided PDF document. Then the number of pages is obtained in line 117
using pdfReader.GetNumPages()
.
Lines 122-139
iterate through the range of page numbers and extract the contents using:
page, err := pdfReader.GetPage(pageNum)
ex, err := extractor.New(page)
text, err := ex.ExtractText()
result += text
The error handling has been hidden in the above snippet for simplicity. Finally, the result and a nil
are returned from the function in line 141
.
Run the code
To run the example, use the command shown below by substituting the list of arguments, of course.
go run concurrent_extraction.go <input1.pdf> <input2.pdf>... <output_dir>