Detect Scanned PDF Document

In this guide, the process of determining if a given PDF is likely scanned will be explained.

Sample Input

Sample PDF file

Before you begin

You should get your API key from your UniCloud account.

If this is your first time using UniPDF SDK, follow this guide to set up a local development environment.

Clone the project repository

In your terminal, clone the examples repository. It contains the Go code we will be using for this guide.

git clone https://github.com/unidoc/unipdf-examples.git

Navigate to the analysis folder in the unipdf-examples directory.

cd unipdf-examples/analysis

How it works

The import section in lines 19-14, imports the necessary UniPDF packages and other libraries. The init function in lines 16-23 loads the metered license key before running the program.

In lines 25-37, the main function is defined. In this function number of command line arguments is checked in lines 26-29. Then the for loop in lines 31-36, iterates through each inputPath provided in the command line arguments and checks if the file is a scanned PDF document by using detectScanned(inputPath).

The detectScanned function in lines 39-64, takes the path to a file and determines whether the file is scanned or not. In this function, in lines 46, the number of pages is obtained from the PdfReader using pdfReader.GetNumPages(). Then number of each object type is obtained using pdfReader.Inspect() in line 51. Finally, in line 57 the number of font objects is checked. If the number of font types is 0 or 1, the document doesn’t have any text objects, which means document is scanned. Otherwise, the document is not scanned.

Run the code

Run the code using the following command:

go run pdf_detect_scanned.go input.pdf

Sample output

sample.pdf (1 pages) - SCANNED!

Got any Questions?

We're here to help you.