Summarize Images

This guide demonstrates how to summarize image information of a PDF document using UniPDF.

Before you begin

You should get your API key from your UniCloud account.

If this is your first time using UniPDF SDK, follow this guide to set up a local development environment.

Clone the project repository

In your terminal, clone the examples repository. It contains the Go code we will be using for this guide.

git clone https://github.com/unidoc/unipdf-examples.git

Navigate to the analysis folder in the unipdf-examples directory.

cd unipdf-examples/analysis

How it works

The import section imports the necessary UniPDF packages and other libraries. The init function in lines 29-36, loads the API keys stored in system environment and set the license using license.SetMeteredKey(os.Getenv(`UNIDOC_LICENSE_API_KEY`)).

The main function starts at line 40. Lines 41-50, parses the command line arguments. The code in lines 59-65, checks the value of trace from the command line argument and sets the log level. Line 67 gets the list of files i.e. corpus and makes sure that the number files is below 1000. Line 71-78, sorts the input files based on their size using:

sort.Slice(corpus, func(i, j int) bool {
  fi, fj := corpus[i], corpus[j]
  si, sj := fileSizeMB(fi), fileSizeMB(fj)
  if si != sj {
    return si < sj
  }
  return fi < fj
})

The code in lines 81-93, iterates though each file and collects the image info to a map. This is done by calling the fileImages(inputPath) function for each inputPath. Line 95 uses showSummary(corpus, corpusInfo) to display the image summary on the standard output. Line 96 saves the summary to a csv file using saveAsCsv(csvPath, corpus, corpusInfo, doSort, byDoc, noDims).

The fileImages function is defined in lines 100-138. This function returns a list of imageInfo for a given input file. In lines 116-133 of this function, the for loop iterates through each page and builds the list of imageInfo by calling the pageImages function as follows:

for pageNum := 1; pageNum <= numPages; pageNum++ {
		page, err := pdfReader.GetPage(pageNum)
		showError(nil, err, "pdfReader.GetPage failed: page %d", pageNum)
		if err != nil {
			continue
		}

		// List images on the page.
		pageInfo, err := pageImages(page)
		if err != nil || len(pageInfo) == 0 {
			continue
		}
		for i := range pageInfo {
			pageInfo[i].path = inputPath
			pageInfo[i].page = pageNum
		}
		fileInfo = append(fileInfo, pageInfo...)
}

This list of image information is returned at line 137.

The pageImages function defined in 141-147, returns a list of images detected in the provided page. This is done by getting all the page content streams and then using contentStreamImages(contents, page.Resources) to get the list of images.

The contentStreamImages function defined in lines 152-283, parses the content stream provided, iterates through each content stream operation and collects all the images. Here each operand is checked if it is BI or Do. If it is BI, then the image is obtained from the first parameter as follows:

iimg, ok := op.Params[0].(*contentstream.ContentStreamInlineImage)
if !ok {
  continue
}

var width, height, cpts, bpc int
img, err := iimg.ToImage(resources)

If the operand is Do, then it means the image is saved as XObjects resources and is obtained as follows:

name := op.Params[0].(*core.PdfObjectName)

// Only process each one once.
if _, has := processedXObjects[string(*name)]; has {
  continue
}
processedXObjects[string(*name)] = true

_, xtype := resources.GetXObjectByName(*name)

After this, the xtype objects is processed separately based on its type. If its type is model.XObjectTypeImage, then it is processed as follows:

ximg, err := resources.GetXObjectImageByName(*name)
showError(errors, err, "GetXObjectImageByName failed: %q ", *name)
if err != nil {
  continue
}

var width, height, cpts, bpc int
img, err := ximg.ToImage()

However, if its type is model.XObjectTypeForm, it is processed using the following code:

// Go through the XObject Form content stream.
xform, err := resources.GetXObjectFormByName(*name)
showError(errors, err, "GetXObjectFormByName failed: %q", *name)
if err != nil {
  continue
}
formContent, err := xform.GetContentStream()
showError(errors, err, "GetContentStream failed: %q", *name)
if err != nil {
  continue
}

// Process the content stream in the Form object too.
formResources := xform.Resources
if formResources == nil {
  formResources = resources
}
formDescs, err := contentStreamImages(string(formContent), formResources)

In lines 284-295, the imageInfo type which contains all details for a given image, is defined. The String() method for this type is defined in lines 297-312. This method returns the string representation of the imageInfo. The asStrings() method defined in lines 314-335, returns an array of strings representing each field of imageInfo object.

The header variables in line 337, contains a list of strings which are used as column headers for a CSV file.

In lines 350-378, the asList function is defined. This function is collects imageInfo of each corpus and returns it as one big list. It optionally sorts the images based on their dimension.

The coallesce in lines 380-403, builds a list of unique imageInfo from the total list. The usage frequency of each imageInfo object is portrayed in the count field. This count is updated in 390-396 as follows:

k := info.String()
if _, ok := uniques[k]; !ok {
  uniques[k] = info
}
info := uniques[k]
info.count++
uniques[k] = info

The saveAsCsv function is used to the save the corpusInfo to a CSV file. In line 415, a new CSV writer is instantiated using csv.NewWriter(f). The corpusInfo object is turned into a list of imageInfo using asList(corpus, corpusInfo, doSort, byDoc, noRes) in line 424. The for loop in lines 426-432, iterates through each imageInfo and writes it to a CSV file.

The showSummary function defined in lines 437-449, prints the summary of corpusInfo. The functions boolSummary, intSummary, stringSummary, are used to print the summary information.

The functions boolKeys, intKeys and stringKeys are used to sort counts provided in the parameter based on the value entries.

The boolCounts function computes the counts of a boolean field specified in the selector and returns the count both by file and by image.

The intCounts function in lines 567-582, counts the frequency of an integer field across the corpusInfo and returns the count by image and by file. The specific integer field is specified by the selector function.

The stringCounts function in lines 584-599, counts the occurrence a string field across the corpusInfo and returns the count result by image and by file.

The function showError in lines 603-616, prints an error message format for an error if it has not been reported before.

The sumVals function counts the number of imageInfo across the corpusInfo provided in the function parameter.

The percentage function calculates and formats a percentage, given a total and fraction of it. The fileSizeMB function in lines 635-641 returns the size of the given file in MB. The makeUsage function updates flag.Usage to include a new usage message.

Run the code

Run the code using the following command:

 go run pdf_summarize_images.go ~/testdata/*.pdf

Sample Output

go run pdf_summarize_images.go PDF32000_2008.pdf 

Here is the output of the above command.

0 of 1 "PDF32000_2008.pdf" 21.4 MB, 756 pages, 172 images, 1.1 sec
=================================================
Totals: 1 of 1 files contain images.    172 images
-----------------------------------------
inline
By image: 1
             false	   172 of 172 (100.0%)
By file: 1
             false	     1 of 1 (100.0%)
-----------------------------------------
filter
By image: 4
         DCTDecode	   128 of 172 (74.4%)
       FlateDecode	    39 of 172 (22.7%)
    CCITTFaxDecode	     4 of 172 ( 2.3%)
               Raw	     1 of 172 ( 0.6%)
By file: 4
    CCITTFaxDecode	     1 of 1 (100.0%)
         DCTDecode	     1 of 1 (100.0%)
       FlateDecode	     1 of 1 (100.0%)
               Raw	     1 of 1 (100.0%)
-----------------------------------------
color
By image: 4
          ICCBased	   102 of 172 (59.3%)
           Indexed	    39 of 172 (22.7%)
        Separation	    25 of 172 (14.5%)
        DeviceGray	     6 of 172 ( 3.5%)
By file: 4
        DeviceGray	     1 of 1 (100.0%)
          ICCBased	     1 of 1 (100.0%)
           Indexed	     1 of 1 (100.0%)
        Separation	     1 of 1 (100.0%)
-----------------------------------------
cpts
By image: 2
                 3	   102 of 172 (59.3%)
                 1	    70 of 172 (40.7%)
By file: 2
                 1	     1 of 1 (100.0%)
                 3	     1 of 1 (100.0%)
-----------------------------------------
bpc
By image: 2
                 8	   168 of 172 (97.7%)
                 1	     4 of 172 ( 2.3%)
By file: 2
                 1	     1 of 1 (100.0%)
                 8	     1 of 1 (100.0%)

Got any Questions?

We're here to help you.