Text Extraction

This guide will show you how to extract text and its respective formatting from a Word document.

Before you begin

You should get your API key from your UniCloud account.

If this is your first time using UniOffice SDK, follow this guide to set up a local development environment.

Clone the project repository

In your terminal, clone the examples repository. It contains the Go code we will be using for this guide.

git clone https://github.com/unidoc/unioffice-examples

To get the example navigate to the path document/text_extraction folder in the unioffice-examples directory.

cd unioffice-examples/document/text_extraction

The repository contains the document.docx file, and the example code main.go, utilized for this example.

How it works

Lines 4-11 import the UniOffice packages and other required dependencies.

The init function in lines 13-20 authenticates your request with your UNIDOC_LICENSE_API_KEY.

The main function, spanning from line 22 to 63, opens the document ‘document.docx’ on line 23 and utilizes the ExtractText method on line 28 to obtain formatted text.

Subsequently, a loop iterates through all the found elements, printing the text of each element. Then, it checks if the element contains any Runs. If so, it retrieves their properties and displays them in the console, including whether it’s in bold, italic, its color, or if it’s underlined.

Additionally, the code checks if the element contains a table element. If it does, it retrieves and displays the number of rows and columns, along with any shade color, if present.

Finally, it checks if it’s a drawing, obtaining its height and width if applicable and displaying it on the screen.

To conclude, if you only want to obtain plain text, you can use the Text method on the result of ExtractText.

Run the code

Execute this command to extract both the text and its formatting from the document.

go run main.go

Sample output

0
Text: Paragraph 1
Bold: false
Italic: false
--------
1
Text: Paragraph 2
Bold: false
Italic: true
--------
2
Text: Table 1
Bold: false
Italic: false
Row: 0
Column: 0
--------
3
Text: Column 1
Bold: false
Italic: false
Row: 0
Column: 1
Shade color: #E7E6E6
--------
4
Text: Column 2
Bold: false
Italic: false
Row: 0
Column: 2
--------
5
Text: Row 1
Bold: false
Italic: false
Row: 1
Column: 0
--------
6
Text: Cell 1-1
Bold: false
Italic: false
Row: 1
Column: 1
Shade color: #E7E6E6
--------
7
Text: Cell 1-2
Bold: false
Italic: false
Row: 1
Column: 2
--------
8
Text: Paragraph 3
Bold: true
Italic: false
--------
9
Text: Paragraph 4
Bold: false
Italic: false
Color: #C00000
Highlight: lightGray
--------
10
Text:
--------
11
Text:
--------
12
Text:
Bold: false
Italic: false
Color: #C00000
--------
13
Text: Hi, I am a Text Box
Bold: false
Italic: false
Height in mm: 54.04247299066626
Width in mm: 93.76369008325126
--------

FLATTENED:
Paragraph 1
Paragraph 2
Table 1
Column 1
Column 2
Row 1
Cell 1-1
Cell 1-2
Paragraph 3
Paragraph 4
Hi, I am a Text Box

Got any Questions?

We're here to help you.