Text Extraction from Presentation
This guide demonstrates the process of text extraction from presentation file using UniOffice
.
Sample input
Before you begin
You should get your API key from your UniCloud account.
If this is your first time using UniOffice SDK, follow this guide to set up a local development environment.
Clone the project repository
In your terminal, clone the examples repository. It contains the Go code we will be using for this guide.
git clone https://github.com/unidoc/unioffice-examples
To get the example navigate to the path presentation/text_extraction
folder in the unioffice-examples
directory.
cd unioffice-examples/presentation/text_extraction/
How it works
The import
section in lines 4-10
, imports the necessary packages and other Go libraries.
The init
function sets the metered license key in order to use the UniOffice
packages.
The main
function defined in lines 21-58
, contains the code that is used to extract the text from the presentation file shown in the sample input section. Line 23
instantiates a presentation.Presentation
object by opening the presentation file. Lines 30-31
extract the contents as plains text and prints them using
pe := ppt.ExtractText()
fmt.Println(pe.Text())
The two nested for loops in lines 32-57
, iterate through each presentation.TextItem
of each presentation.SlideText
and extracts the text content and other metadata of the presentation file.
Run the code
Run this command to extract the text content of the presentation file.
go run main.go
Sample output
Some text
containing several lines.
Lorem ipsum
dolor
sit
amet
,
consectetur
adipiscing
elit
.
The title
The subtitle
Some extra info
The table
Column 1
Column 2
Column 3
Column 4
Row 1
Cell 11
Cell 12
Cell 13
Cell 14
Row 2
Cell 21
Cell 22
Cell 23
Cell 24
Header 1
While text is sorted from the left to the right first and then from the top to the bottom, this text should go right after ‘Header 1’ because they share the same textbox
Header 2
While being located a bit above ‘Header 1’ textbox, this header and text should go after it as the difference between y-coordinates of textboxes is not
significant compared to the difference between x coordinates
Header 3
Header 4
This text does not share the same textbox with ‘Header 3’ so it is treated as a separate element and will go only after ‘Header 4’
This text will go last anyway
0
Some text
Bold: false
Italic: false
--------
1
containing several lines.
Bold: false
Italic: false
--------
2
Lorem ipsum
Bold: false
Italic: false
--------
3
dolor
Bold: false
Italic: false
--------
4
sit
Bold: false
Italic: false
--------
5
amet
Bold: false
Italic: false
--------
6
,
Bold: false
Italic: false
--------
7
consectetur
Bold: false
Italic: false
--------
8
Bold: false
Italic: false
--------
9
adipiscing
Bold: false
Italic: false
--------
10
Bold: false
Italic: false
--------
11
elit
Bold: false
Italic: false
--------
12
.
Bold: false
Italic: false
--------
13
The title
Bold: false
Italic: false
Font size: 36
SolidFill: bg1
--------
14
The subtitle
Bold: false
Italic: false
Font size: 12
SolidFill: bg1
--------
15
Some extra info
Bold: false
Italic: true
Font size: 14
SolidFill: bg1
--------
0
The table
Bold: false
Italic: false
--------
1
Column 1
Bold: false
Italic: false
Row: 0
Column: 1
height: 370840
width: 2048193
--------
2
Column 2
Bold: false
Italic: false
Row: 0
Column: 2
height: 370840
width: 2048193
--------
3
Column 3
Bold: false
Italic: false
Row: 0
Column: 3
height: 370840
width: 2048193
--------
4
Column 4
Bold: false
Italic: false
Row: 0
Column: 4
height: 370840
width: 2048193
--------
5
Row 1
Bold: false
Italic: false
Row: 1
Column: 0
height: 370840
width: 2048193
--------
6
Cell 11
Bold: false
Italic: false
Row: 1
Column: 1
height: 370840
width: 2048193
--------
7
Cell 12
Bold: false
Italic: false
Row: 1
Column: 2
height: 370840
width: 2048193
--------
8
Cell 13
Bold: false
Italic: false
Row: 1
Column: 3
height: 370840
width: 2048193
--------
9
Cell 14
Bold: false
Italic: false
Row: 1
Column: 4
height: 370840
width: 2048193
--------
10
Row 2
Bold: false
Italic: false
Row: 2
Column: 0
height: 370840
width: 2048193
--------
11
Cell 21
Bold: false
Italic: false
Row: 2
Column: 1
height: 370840
width: 2048193
--------
12
Cell 22
Bold: false
Italic: false
Row: 2
Column: 2
height: 370840
width: 2048193
--------
13
Cell 23
Bold: false
Italic: false
Row: 2
Column: 3
height: 370840
width: 2048193
--------
14
Cell 24
Bold: false
Italic: false
Row: 2
Column: 4
height: 370840
width: 2048193
--------
0
Header 1
Bold: true
Italic: false
Font size: 24
--------
1
While text is sorted from the left to the right first and then from the top to the bottom, this text should go right after ‘Header 1’ because they share the same textbox
Bold: true
Italic: false
--------
2
Header 2
Bold: true
Italic: false
Font size: 24
--------
3
While being located a bit above ‘Header 1’ textbox, this header and text should go after it as the difference between y-coordinates of textboxes is not
Bold: true
Italic: false
--------
4
significant compared to the difference between x coordinates
Bold: true
Italic: false
--------
5
Header 3
Bold: true
Italic: false
Font size: 24
--------
6
Header 4
Bold: true
Italic: false
Font size: 24
--------
7
This text does not share the same textbox with ‘Header 3’ so it is treated as a separate element and will go only after ‘Header 4’
Bold: true
Italic: false
--------
8
This text will go last anyway
Bold: true
Italic: false