Text Extraction from Presentation

This guide demonstrates the process of text extraction from presentation file using UniOffice.

Sample input

sample input

Before you begin

You should get your API key from your UniCloud account.

If this is your first time using UniOffice SDK, follow this guide to set up a local development environment.

Clone the project repository

In your terminal, clone the examples repository. It contains the Go code we will be using for this guide.

git clone https://github.com/unidoc/unioffice-examples

To get the example navigate to the path presentation/text_extraction folder in the unioffice-examples directory.

cd unioffice-examples/presentation/text_extraction/

How it works

The import section in lines 4-10, imports the necessary packages and other Go libraries. The init function sets the metered license key in order to use the UniOffice packages.

The main function defined in lines 21-58, contains the code that is used to extract the text from the presentation file shown in the sample input section. Line 23 instantiates a presentation.Presentation object by opening the presentation file. Lines 30-31 extract the contents as plains text and prints them using

pe := ppt.ExtractText()
fmt.Println(pe.Text()) 

The two nested for loops in lines 32-57, iterate through each presentation.TextItem of each presentation.SlideText and extracts the text content and other metadata of the presentation file.

Run the code

Run this command to extract the text content of the presentation file.

go run main.go

Sample output

Some text 
containing several lines.
Lorem ipsum 
dolor
 sit 
amet
, 
consectetur
 
adipiscing
 
elit
.
The title
The subtitle
Some extra info
The table
Column 1
Column 2
Column 3
Column 4
Row 1
Cell 11
Cell 12
Cell 13
Cell 14
Row 2
Cell 21
Cell 22
Cell 23
Cell 24
Header 1
While text is sorted from the left to the right first and then from the top to the bottom, this text should go right after ‘Header 1’ because they share the same textbox
Header 2
While being located a bit above ‘Header 1’ textbox, this header and text should go after it as the difference between y-coordinates of textboxes is not 
significant compared to the difference between x coordinates
Header 3
Header 4
This text does not share the same textbox with ‘Header 3’ so it is treated as a separate element and will go only after ‘Header 4’
This text will go last anyway

0
Some text 
Bold: false
Italic: false
--------
1
containing several lines.
Bold: false
Italic: false
--------
2
Lorem ipsum 
Bold: false
Italic: false
--------
3
dolor
Bold: false
Italic: false
--------
4
 sit 
Bold: false
Italic: false
--------
5
amet
Bold: false
Italic: false
--------
6
, 
Bold: false
Italic: false
--------
7
consectetur
Bold: false
Italic: false
--------
8
 
Bold: false
Italic: false
--------
9
adipiscing
Bold: false
Italic: false
--------
10
 
Bold: false
Italic: false
--------
11
elit
Bold: false
Italic: false
--------
12
.
Bold: false
Italic: false
--------
13
The title
Bold: false
Italic: false
Font size: 36
SolidFill: bg1
--------
14
The subtitle
Bold: false
Italic: false
Font size: 12
SolidFill: bg1
--------
15
Some extra info
Bold: false
Italic: true
Font size: 14
SolidFill: bg1
--------
0
The table
Bold: false
Italic: false
--------
1
Column 1
Bold: false
Italic: false
Row: 0
Column: 1
height: 370840
width: 2048193
--------
2
Column 2
Bold: false
Italic: false
Row: 0
Column: 2
height: 370840
width: 2048193
--------
3
Column 3
Bold: false
Italic: false
Row: 0
Column: 3
height: 370840
width: 2048193
--------
4
Column 4
Bold: false
Italic: false
Row: 0
Column: 4
height: 370840
width: 2048193
--------
5
Row 1
Bold: false
Italic: false
Row: 1
Column: 0
height: 370840
width: 2048193
--------
6
Cell 11
Bold: false
Italic: false
Row: 1
Column: 1
height: 370840
width: 2048193
--------
7
Cell 12
Bold: false
Italic: false
Row: 1
Column: 2
height: 370840
width: 2048193
--------
8
Cell 13
Bold: false
Italic: false
Row: 1
Column: 3
height: 370840
width: 2048193
--------
9
Cell 14
Bold: false
Italic: false
Row: 1
Column: 4
height: 370840
width: 2048193
--------
10
Row 2
Bold: false
Italic: false
Row: 2
Column: 0
height: 370840
width: 2048193
--------
11
Cell 21
Bold: false
Italic: false
Row: 2
Column: 1
height: 370840
width: 2048193
--------
12
Cell 22
Bold: false
Italic: false
Row: 2
Column: 2
height: 370840
width: 2048193
--------
13
Cell 23
Bold: false
Italic: false
Row: 2
Column: 3
height: 370840
width: 2048193
--------
14
Cell 24
Bold: false
Italic: false
Row: 2
Column: 4
height: 370840
width: 2048193
--------
0
Header 1
Bold: true
Italic: false
Font size: 24
--------
1
While text is sorted from the left to the right first and then from the top to the bottom, this text should go right after ‘Header 1’ because they share the same textbox
Bold: true
Italic: false
--------
2
Header 2
Bold: true
Italic: false
Font size: 24
--------
3
While being located a bit above ‘Header 1’ textbox, this header and text should go after it as the difference between y-coordinates of textboxes is not 
Bold: true
Italic: false
--------
4
significant compared to the difference between x coordinates
Bold: true
Italic: false
--------
5
Header 3
Bold: true
Italic: false
Font size: 24
--------
6
Header 4
Bold: true
Italic: false
Font size: 24
--------
7
This text does not share the same textbox with ‘Header 3’ so it is treated as a separate element and will go only after ‘Header 4’
Bold: true
Italic: false
--------
8
This text will go last anyway
Bold: true
Italic: false

Got any Questions?

We're here to help you.