Get hOCR output
This guide explains how to use UniPDF's OCR capabilities to get hOCR output for an image.
This guide demonstrates how to get hOCR output for an image file using UniPDF’s OCR feature. hOCR is a standard format for representing OCR output, containing text and layout information.
Sample Input

Before you begin
You should get your API key from your UniCloud account.
If this is your first time using UniPDF SDK, follow this guide to set up a local development environment.
You also need to have the UniDoc OCR server running.
First, clone the ocrserver repository:
git clone https://github.com/unidoc/ocrserver.git
Then, navigate to the ocrserver directory and run the server using Docker Compose:
cd ocrserver
docker-compose up
Clone the project repository
In your terminal, clone the examples repository. It contains the Go code we will be using for this guide.
git clone https://github.com/unidoc/unipdf-examples.git
Navigate to the ocr folder in the unipdf-examples directory.
cd unipdf-examples/ocr
How it works
The hocr_sample.go example shows how to get the hOCR output for a single image.
The code is similar to the ocr_sample.go example. The main difference is that in the OCROptions, the FormFields map has a format key with the value hocr. This tells the OCR server to return the output in hOCR format.
The main function calls the client.ProcessFile method, which returns the hOCR output as a string. The program then prints this string to the console.
Run the code
Run the code with an image file as input:
go run hocr_sample.go /path/to/your/image.png
Sample Output
Extracted text: <div class='ocr_page' id='page_1' title='image ""; bbox 0 0 470 306; ppageno 0'>
<div class='ocr_carea' id='block_1_1' title="bbox 62 48 443 275">
<p class='ocr_par' id='par_1_1' lang='eng' title="bbox 63 48 241 85">
<span class='ocr_line' id='line_1_1' title="bbox 63 48 241 85; baseline 0 -10; x_size 37; x_descenders 10; x_ascenders 7">
<span class='ocrx_word' id='word_1_1' title='bbox 63 49 187 75; x_wconf 95'>Secure</span>
<span class='ocrx_word' id='word_1_2' title='bbox 197 48 241 85; x_wconf 95'>by</span>
</span>
</p>
<p class='ocr_par' id='par_1_2' lang='eng' title="bbox 62 93 443 275">
<span class='ocr_line' id='line_1_2' title="bbox 129 93 136 100; baseline 0 0; x_size 21.671875; x_descenders 5.5572915; x_ascenders 5.5572915">
<span class='ocrx_word' id='word_1_3' title='bbox 129 93 136 100; x_wconf 0'>°</span>
</span>
<span class='ocr_line' id='line_1_3' title="bbox 62 95 184 132; baseline 0 -10; x_size 37; x_descenders 10; x_ascenders 7">
<span class='ocrx_word' id='word_1_4' title='bbox 62 95 184 132; x_wconf 96'>design</span>
</span>
<span class='ocr_line' id='line_1_4' title="bbox 62 153 443 172; baseline 0 -5; x_size 19; x_descenders 5; x_ascenders 4">
<span class='ocrx_word' id='word_1_5' title='bbox 62 154 107 172; x_wconf 97'>Every</span>
<span class='ocrx_word' id='word_1_6' title='bbox 113 154 175 167; x_wconf 96'>release</span>
<span class='ocrx_word' id='word_1_7' title='bbox 181 153 197 167; x_wconf 96'>of</span>
<span class='ocrx_word' id='word_1_8' title='bbox 203 157 231 167; x_wconf 93'>our</span>
<span class='ocrx_word' id='word_1_9' title='bbox 237 153 304 167; x_wconf 93'>libraries</span>
<span class='ocrx_word' id='word_1_10' title='bbox 310 153 322 167; x_wconf 93'>is</span>
<span class='ocrx_word' id='word_1_11' title='bbox 328 153 443 167; x_wconf 91'>automatical-</span>
</span>
<span class='ocr_line' id='line_1_5' title="bbox 62 180 431 199; baseline 0 -5; x_size 19; x_descenders 5; x_ascenders 4">
<span class='ocrx_word' id='word_1_12' title='bbox 62 181 75 199; x_wconf 96'>ly</span>
<span class='ocrx_word' id='word_1_13' title='bbox 80 181 134 194; x_wconf 96'>tested</span>
<span class='ocrx_word' id='word_1_14' title='bbox 140 180 206 199; x_wconf 97'>against</span>
<span class='ocrx_word' id='word_1_15' title='bbox 212 181 267 194; x_wconf 96'>known</span>
<span class='ocrx_word' id='word_1_16' title='bbox 273 180 392 194; x_wconf 96'>vulnerabilities</span>
<span class='ocrx_word' id='word_1_17' title='bbox 398 181 431 194; x_wconf 96'>and</span>
</span>
<span class='ocr_line' id='line_1_6' title="bbox 62 207 433 226; baseline 0 -5; x_size 19; x_descenders 5; x_ascenders 4">
<span class='ocrx_word' id='word_1_18' title='bbox 62 208 84 221; x_wconf 96'>do</span>
<span class='ocrx_word' id='word_1_19' title='bbox 90 209 117 221; x_wconf 96'>not</span>
<span class='ocrx_word' id='word_1_20' title='bbox 123 211 164 226; x_wconf 96'>pass</span>
<span class='ocrx_word' id='word_1_21' title='bbox 170 208 224 221; x_wconf 96'>unless</span>
<span class='ocrx_word' id='word_1_22' title='bbox 230 207 321 226; x_wconf 96'>everything</span>
<span class='ocrx_word' id='word_1_23' title='bbox 327 207 339 221; x_wconf 93'>is</span>
<span class='ocrx_word' id='word_1_24' title='bbox 345 207 433 221; x_wconf 92'>remediat-</span>
</span>
<span class='ocr_line' id='line_1_7' title="bbox 62 234 417 253; baseline 0 -5; x_size 19; x_descenders 5; x_ascenders 4">
<span class='ocrx_word' id='word_1_25' title='bbox 62 235 87 248; x_wconf 95'>ed.</span>
<span class='ocrx_word' id='word_1_26' title='bbox 92 235 111 248; x_wconf 95'>All</span>
<span class='ocrx_word' id='word_1_27' title='bbox 117 235 193 253; x_wconf 95'>changes</span>
<span class='ocrx_word' id='word_1_28' title='bbox 199 238 227 248; x_wconf 96'>are</span>
<span class='ocrx_word' id='word_1_29' title='bbox 232 234 307 253; x_wconf 96'>carefully</span>
<span class='ocrx_word' id='word_1_30' title='bbox 313 234 390 248; x_wconf 96'>reviewed</span>
<span class='ocrx_word' id='word_1_31' title='bbox 397 235 417 253; x_wconf 97'>by</span>
</span>
<span class='ocr_line' id='line_1_8' title="bbox 62 263 145 275; baseline 0 0; x_size 21.671875; x_descenders 5.5572915; x_ascenders 5.5572915">
<span class='ocrx_word' id='word_1_32' title='bbox 62 265 90 275; x_wconf 93'>our</span>
<span class='ocrx_word' id='word_1_33' title='bbox 95 263 145 275; x_wconf 93'>team.</span>
</span>
</p>
</div>
</div>