Summarize Images
This guide demonstrates how to summarize image information of a PDF document using UniPDF.
Before you begin
You should get your API key from your UniCloud account.
If this is your first time using UniPDF SDK, follow this guide to set up a local development environment.
Clone the project repository
In your terminal, clone the examples repository. It contains the Go code we will be using for this guide.
git clone https://github.com/unidoc/unipdf-examples.git
Navigate to the analysis
folder in the unipdf-examples
directory.
cd unipdf-examples/analysis
How it works
The import
section imports the necessary UniPDF packages and other libraries.
The init
function in lines 29-36
, loads the API keys stored in system environment and set the license using license.SetMeteredKey(os.Getenv(`UNIDOC_LICENSE_API_KEY`))
.
The main
function starts at line 40
. Lines 41-50
, parses the command line arguments. The code in lines 59-65
, checks the value of trace from the command line argument and sets the log level. Line 67
gets the list of files i.e. corpus and makes sure that the number files is below 1000. Line 71-78
, sorts the input files based on their size using:
sort.Slice(corpus, func(i, j int) bool {
fi, fj := corpus[i], corpus[j]
si, sj := fileSizeMB(fi), fileSizeMB(fj)
if si != sj {
return si < sj
}
return fi < fj
})
The code in lines 81-93
, iterates though each file and collects the image info to a map. This is done by calling the fileImages(inputPath)
function for each inputPath. Line 95
uses showSummary(corpus, corpusInfo)
to display the image summary on the standard output. Line 96
saves the summary to a csv file using saveAsCsv(csvPath, corpus, corpusInfo, doSort, byDoc, noDims)
.
The fileImages
function is defined in lines 100-138
. This function returns a list of imageInfo
for a given input file. In lines 116-133
of this function, the for loop iterates through each page and builds the list of imageInfo
by calling the pageImages
function as follows:
for pageNum := 1; pageNum <= numPages; pageNum++ {
page, err := pdfReader.GetPage(pageNum)
showError(nil, err, "pdfReader.GetPage failed: page %d", pageNum)
if err != nil {
continue
}
// List images on the page.
pageInfo, err := pageImages(page)
if err != nil || len(pageInfo) == 0 {
continue
}
for i := range pageInfo {
pageInfo[i].path = inputPath
pageInfo[i].page = pageNum
}
fileInfo = append(fileInfo, pageInfo...)
}
This list of image information is returned at line 137
.
The pageImages
function defined in 141-147
, returns a list of images detected in the provided page. This is done by getting all the page content streams and then using contentStreamImages(contents, page.Resources)
to get the list of images.
The contentStreamImages
function defined in lines 152-283
, parses the content stream provided, iterates through each content stream operation and collects all the images. Here each operand is checked if it is BI
or Do
. If it is BI
, then the image is obtained from the first parameter as follows:
iimg, ok := op.Params[0].(*contentstream.ContentStreamInlineImage)
if !ok {
continue
}
var width, height, cpts, bpc int
img, err := iimg.ToImage(resources)
If the operand is Do
, then it means the image is saved as XObjects
resources and is obtained as follows:
name := op.Params[0].(*core.PdfObjectName)
// Only process each one once.
if _, has := processedXObjects[string(*name)]; has {
continue
}
processedXObjects[string(*name)] = true
_, xtype := resources.GetXObjectByName(*name)
After this, the xtype
objects is processed separately based on its type. If its type is model.XObjectTypeImage
, then it is processed as follows:
ximg, err := resources.GetXObjectImageByName(*name)
showError(errors, err, "GetXObjectImageByName failed: %q ", *name)
if err != nil {
continue
}
var width, height, cpts, bpc int
img, err := ximg.ToImage()
However, if its type is model.XObjectTypeForm
, it is processed using the following code:
// Go through the XObject Form content stream.
xform, err := resources.GetXObjectFormByName(*name)
showError(errors, err, "GetXObjectFormByName failed: %q", *name)
if err != nil {
continue
}
formContent, err := xform.GetContentStream()
showError(errors, err, "GetContentStream failed: %q", *name)
if err != nil {
continue
}
// Process the content stream in the Form object too.
formResources := xform.Resources
if formResources == nil {
formResources = resources
}
formDescs, err := contentStreamImages(string(formContent), formResources)
In lines 284-295
, the imageInfo
type which contains all details for a given image, is defined. The String()
method for this type is defined in lines 297-312
. This method returns the string representation of the imageInfo
.
The asStrings()
method defined in lines 314-335
, returns an array of strings representing each field of imageInfo
object.
The header
variables in line 337
, contains a list of strings which are used as column headers for a CSV file.
In lines 350-378
, the asList
function is defined. This function is collects imageInfo
of each corpus and returns it as one big list. It optionally sorts the images based on their dimension.
The coallesce
in lines 380-403
, builds a list of unique imageInfo
from the total list. The usage frequency of each imageInfo
object is portrayed in the count
field. This count is updated in 390-396
as follows:
k := info.String()
if _, ok := uniques[k]; !ok {
uniques[k] = info
}
info := uniques[k]
info.count++
uniques[k] = info
The saveAsCsv
function is used to the save the corpusInfo
to a CSV file. In line 415
, a new CSV writer is instantiated using csv.NewWriter(f)
. The corpusInfo
object is turned into a list of imageInfo
using asList(corpus, corpusInfo, doSort, byDoc, noRes)
in line 424
. The for loop in lines 426-432
, iterates through each imageInfo
and writes it to a CSV file.
The showSummary
function defined in lines 437-449
, prints the summary of corpusInfo
. The functions boolSummary
, intSummary
, stringSummary
, are used to print the summary information.
The functions boolKeys
, intKeys
and stringKeys
are used to sort counts provided in the parameter based on the value entries.
The boolCounts
function computes the counts of a boolean field specified in the selector and returns the count both by file and by image.
The intCounts
function in lines 567-582
, counts the frequency of an integer field across the corpusInfo
and returns the count by image and by file. The specific integer field is specified by the selector function.
The stringCounts
function in lines 584-599
, counts the occurrence a string field across the corpusInfo
and returns the count result by image and by file.
The function showError
in lines 603-616
, prints an error message format
for an error if it has not been reported before.
The sumVals
function counts the number of imageInfo
across the corpusInfo
provided in the function parameter.
The percentage
function calculates and formats a percentage, given a total and fraction of it.
The fileSizeMB
function in lines 635-641
returns the size of the given file in MB
.
The makeUsage
function updates flag.Usage
to include a new usage message.
Run the code
Run the code using the following command:
go run pdf_summarize_images.go ~/testdata/*.pdf
Sample Output
go run pdf_summarize_images.go PDF32000_2008.pdf
Here is the output of the above command.
0 of 1 "PDF32000_2008.pdf" 21.4 MB, 756 pages, 172 images, 1.1 sec
=================================================
Totals: 1 of 1 files contain images. 172 images
-----------------------------------------
inline
By image: 1
false 172 of 172 (100.0%)
By file: 1
false 1 of 1 (100.0%)
-----------------------------------------
filter
By image: 4
DCTDecode 128 of 172 (74.4%)
FlateDecode 39 of 172 (22.7%)
CCITTFaxDecode 4 of 172 ( 2.3%)
Raw 1 of 172 ( 0.6%)
By file: 4
CCITTFaxDecode 1 of 1 (100.0%)
DCTDecode 1 of 1 (100.0%)
FlateDecode 1 of 1 (100.0%)
Raw 1 of 1 (100.0%)
-----------------------------------------
color
By image: 4
ICCBased 102 of 172 (59.3%)
Indexed 39 of 172 (22.7%)
Separation 25 of 172 (14.5%)
DeviceGray 6 of 172 ( 3.5%)
By file: 4
DeviceGray 1 of 1 (100.0%)
ICCBased 1 of 1 (100.0%)
Indexed 1 of 1 (100.0%)
Separation 1 of 1 (100.0%)
-----------------------------------------
cpts
By image: 2
3 102 of 172 (59.3%)
1 70 of 172 (40.7%)
By file: 2
1 1 of 1 (100.0%)
3 1 of 1 (100.0%)
-----------------------------------------
bpc
By image: 2
8 168 of 172 (97.7%)
1 4 of 172 ( 2.3%)
By file: 2
1 1 of 1 (100.0%)
8 1 of 1 (100.0%)