Text Extraction
This guide will show you how to extract text and its respective formatting from a Word document.
Before you begin
You should get your API key from your UniCloud account.
If this is your first time using UniOffice SDK, follow this guide to set up a local development environment.
Clone the project repository
In your terminal, clone the examples repository. It contains the Go code we will be using for this guide.
git clone https://github.com/unidoc/unioffice-examples
To get the example navigate to the path document/text_extraction
folder in the unioffice-examples directory.
cd unioffice-examples/document/text_extraction
The repository contains the document.docx
file, and the example code main.go
, utilized for this example.
How it works
Lines 4-11
import the UniOffice packages and other required dependencies.
The init function in lines 13-20
authenticates your request with your UNIDOC_LICENSE_API_KEY
.
The main function, spanning from line 22 to 63
, opens the document ‘document.docx’ on line 23 and utilizes the ExtractText
method on line 28
to obtain formatted text.
Subsequently, a loop iterates through all the found elements, printing the text of each element. Then, it checks if the element contains any Runs. If so, it retrieves their properties and displays them in the console, including whether it’s in bold, italic, its color, or if it’s underlined.
Additionally, the code checks if the element contains a table element. If it does, it retrieves and displays the number of rows and columns, along with any shade color, if present.
Finally, it checks if it’s a drawing, obtaining its height and width if applicable and displaying it on the screen.
To conclude, if you only want to obtain plain text, you can use the Text method on the result of ExtractText
.
Run the code
Execute this command to extract both the text and its formatting from the document.
go run main.go
Sample output
0
Text: Paragraph 1
Bold: false
Italic: false
--------
1
Text: Paragraph 2
Bold: false
Italic: true
--------
2
Text: Table 1
Bold: false
Italic: false
Row: 0
Column: 0
--------
3
Text: Column 1
Bold: false
Italic: false
Row: 0
Column: 1
Shade color: #E7E6E6
--------
4
Text: Column 2
Bold: false
Italic: false
Row: 0
Column: 2
--------
5
Text: Row 1
Bold: false
Italic: false
Row: 1
Column: 0
--------
6
Text: Cell 1-1
Bold: false
Italic: false
Row: 1
Column: 1
Shade color: #E7E6E6
--------
7
Text: Cell 1-2
Bold: false
Italic: false
Row: 1
Column: 2
--------
8
Text: Paragraph 3
Bold: true
Italic: false
--------
9
Text: Paragraph 4
Bold: false
Italic: false
Color: #C00000
Highlight: lightGray
--------
10
Text:
--------
11
Text:
--------
12
Text:
Bold: false
Italic: false
Color: #C00000
--------
13
Text: Hi, I am a Text Box
Bold: false
Italic: false
Height in mm: 54.04247299066626
Width in mm: 93.76369008325126
--------
FLATTENED:
Paragraph 1
Paragraph 2
Table 1
Column 1
Column 2
Row 1
Cell 1-1
Cell 1-2
Paragraph 3
Paragraph 4
Hi, I am a Text Box