Concurrent Text Extraction

This guide will demonstrate how to extract text from multiple documents concurrently by taking advantage of the document-level concurrency provided in UniPDF.

Before you begin

You should get your API key from yourUniCloud account.

If this is your first time using UniPDF SDK, follow this guide to set up a local development environment.

Clone the project repository

In your terminal, clone the examples repository. It contains the Go code we will be using for this guide.

git clone https://github.com/unidoc/unipdf-examples.git

Navigate to the concurrent-processing folder in the unipdf-examples directory.

cd unipdf-examples/concurrent-processing

How it works

Lines 11-22 import the necessary packages from unipdf and the standard Go library. Then the init function in lines 24-31 initializes the package by setting the metered license key using license.SetMeteredKey(os.Getenv(`UNIDOC_LICENSE_API_KEY`)).

The main function in lines 33-53 runs the whole extraction process. In lines 34-37 of this function, it checks the minimum command line arguments. Then, in lines 38-47 it populates the inputDocuments and outputDir by taking the values provided via the command line arguments. Line 49 starts a timer and line 50 calls the runConcurrent function using the appropriate arguments. Then, lines 51-52 stops the timer and displays the time taken to process the documents, respectively.

The runConcurrent function defined in lines 56-89 accepts a list of PDF documents and an output directory specified by the parameters documents and outputDir, respectively and processes all the documents concurrently.

First, it creates a channel of type map[string]string with a buffer length the same as the number of documents. This is enables the go routines to process the documents without waiting for the result to be read on the other end of the channel. Then documents and the channel are sent to the concurrent processor using concurrentExtraction(documents, res) call in line 59.

In lines 60-62, any error returned from the call is logged. Lines 63-70 create the output directory if it doesn’t exist. In lines 73-88, using the outer loop, all the results written to the channel are read. The inner loop encompassing lines 75-87, is used to unpack the key and value of the map, which are the file path and document content written to the channel. The code in lines 76-86 just creates text files inside the output directory with the same name as input files and writes the respective content to the text files.

The concurrentExtraction function, which is called by the previous function, is where the concurrency is implemented.Here, it iterates through each PDF document and launches a go routine that extracts the text from the document by calling the extractSingleDoc function. In the closure function defined in lines 96-85, first the result is obtained in line 97. Then a temporary map object is created in line 101-103 with the file path as a key and the extracted text as a value, using:

  temp := map[string]string{
      filePath: result,
  }

Then this temp map is written to the channel in line 104.

Finally, the function that was called for each document is defined in lines 112-142. This function uses model.NewPdfReaderFromFile method to get the pointer to a new model.PdfReader object from the provided PDF document. Then the number of pages is obtained in line 117 using pdfReader.GetNumPages(). Lines 122-139 iterate through the range of page numbers and extract the contents using:

page, err := pdfReader.GetPage(pageNum)
ex, err := extractor.New(page)
text, err := ex.ExtractText()
result += text

The error handling has been hidden in the above snippet for simplicity. Finally, the result and a nil are returned from the function in line 141.

Run the code

To run the example, use the command shown below by substituting the list of arguments, of course.

go run concurrent_extraction.go <input1.pdf> <input2.pdf>... <output_dir>

Got any Questions?

We're here to help you.