Skip to main content

Tesseract OCR — Open Source Text Recognition Engine

Tesseract OCR is a deep-learning engine designed to extract machine-readable text from images and PDFs. Use this documentation to master installation on Windows, Linux, and macOS, configure AI engine modes, and integrate OCR into your software pipelines.

Note: Tesseract is purely a command-line program and API backend. It does not include a visual GUI application. If you require a user interface, you must rely on 3rdParty wrappers.

What is Tesseract?

Tesseract is an engine that takes raw image pixels and converts them into structured, searchable text data. Originating at Hewlett Packard in 1985, it is currently maintained by the global open source community and handles over 100 languages natively via deep learning Long Short-Term Memory (LSTM) neural networks.


Installation

Because Tesseract is an optimized C++ library, the easiest way to install it is via your system's package manager.

macOS

Homebrew is the officially recommended method for macOS (Silicon and Intel).

Terminal
brew install tesseract
brew install tesseract-lang

Ubuntu / Debian

The standard `apt` repositories carry stable versions of Tesseract.

Terminal
sudo apt install tesseract-ocr
sudo apt install tesseract-ocr-all

Windows

Pre-compiled Windows installers are provided by UB Mannheim. Standard package managers like `scoop` or `winget` also support Tesseract natively.

PowerShell
scoop install tesseract
scoop install tesseract-languages

Quickstart & Basic CLI

Tesseract is fundamentally a command-line tool. Providing an image and determining a text output requires a single line.

Terminal
tesseract imagename outputbase [-l lang] [-psm pagesegmode] [configfile...]

Example: Extract English Text

To extract text from `invoice.png` and save it to `invoice_result.txt`:

Terminal
tesseract invoice.png invoice_result -l eng
Note: Omit the `.txt` extension in the output base. Tesseract will automatically append `.txt` if printing standard text formatting.

Page Segmentation Modes (PSM)

Page Segmentation Modes (PSM) determine how Tesseract analyzes the layout of an image to find text blocks. By default, Tesseract expects a full page of text (PSM 3). If your input is a single word, a vertical column, or sparse text, you must declare the correct mode using the --psm flag.

Key Takeaway: Use **PSM 3 (Auto)** for general documents, **PSM 6 (Single Block)** for uniform text chunks, and **PSM 7 (Single Line)** for barcodes or labels.
  • 0: Orientation and script detection (OSD) only.
  • 1: Automatic page segmentation with OSD.
  • 3: Fully automatic page segmentation, but no OSD. *(Default)*
  • 4: Assume a single column of text of variable sizes.
  • 6: Assume a single uniform block of text.
  • 7: Treat the image as a single text line.
  • 11: Sparse text. Find as much text as possible in no particular order.
  • 13: Raw line. Treat the image as a single text line, bypassing hacks that are Tesseract-specific.

To force an assumed single line of characters:

Terminal
tesseract barcode.png stdout --psm 7

OCR Engine Modes (OEM)

OCR Engine Modes (OEM) switch between Tesseract's legacy pattern-matching engine and the modern LSTM neural network. You can toggle between these modes using the --oem flag to balance speed and recognition accuracy.

Key Takeaway: **OEM 1 (LSTM)** is the default for Tesseract 5 and provides the highest accuracy. **OEM 0 (Legacy)** is only recommended for specific fonts that pre-date neural network models.
  • 0: Legacy engine only. (Uses traditional computer vision parsing).
  • 1: Neural nets LSTM engine only. (Fast, highly accurate sequential memory parsing).
  • 2: Legacy + LSTM engines combined.
  • 3: Default, based on what is available in your `.traineddata` models.
Compatibility: Not all language packs support the legacy engine (OEM 0). The "fast" repo models only contain LSTM neural net data (OEM 1).

Output Formats

While extracting to raw `stdout` or `.txt` is common, Tesseract is a full document analyzer capable of emitting layout geometries and fully compliant PDFs.

Searchable PDFs

To convert an image to a bundled, searchable PDF where the recognized text is laid invisibly over the raw image:

Terminal
tesseract document.tif output_name pdf

Invisible Text Only PDF

Useful if you are overlaying text over existing PDFs inside an orchestration pipeline:

Terminal
tesseract scan.png output textonly_pdf

hOCR / TSV / ALTO

If you require data detailing the exact pixel bounding boxes of every single extracted word and its confident rating, use layout generation modes.

Terminal
tesseract input.png out hocr
tesseract input.png out tsv

Programming Wrappers

Do you want to integrate Tesseract inside a web application or microservice? The open source community has built wrappers for nearly every language.

Python (pytesseract)

Requires the Tesseract CLI tool to be installed on the machine.

Python
import pytesseract
from PIL import Image

img = Image.open('image.png')
text = pytesseract.image_to_string(img)
print(text)

Node.js JavaScript (tesseract.js)

This is a pure WebAssembly port of the Tesseract C++ API. It can run massively in the browser without any server installations.

JavaScript
const Tesseract = require('tesseract.js');

Tesseract.recognize(
  'https://tesseract.project/image.png',
  'eng',
  { logger: m => console.log(m) }
).then(({ data: { text } }) => {
  console.log(text);
});

Training Custom OCR Models

Tesseract 5 uses the `tesstrain` project infrastructure to manipulate the LSTM models. Modifying these neural nets requires `Make` and significantly complex ground-truth generation.

The tesstrain Repository

Unlike version 3, which relied heavily on manual box manipulation, version 5 training is automated using Makefiles that generate massive pipelines of training logic.

Terminal
git clone https://github.com/tesseract-ocr/tesstrain
cd tesstrain
make tesseract-langdata

For deep knowledge on curating Ground Truth (GT) and fine-tuning epochs, refer directly to the `tesstrain` GitHub repository.

End of Tesseract Core Documentation.

Disclaimer: TesseractOCR.org is an independent, community-driven documentation project and is not affiliated with, endorsed by, or connected to the official Tesseract OCR project or its maintainers.