Tesseract OCR text order for documents with tables or rows -

- June 15, 2012

i using tesseract ocr convert scanned pdf's plain text. overall highly effective having issues order text scanned. documents tabular data seem scan down column column when seems more natural way scan row row. small scale example be:

this column a, row 1   column b, row 1    column c, row 1 column a, row 2   column b, row 2    column c, row 2

is yielding following text:

this column a, row 1 column a, row 2 column b, row 1 column b, row 2 column c, row 1 column c, row 2

i starting read documentation , guess , test, brute force approach parameters documented here if has tackled issue similar, appreciate insight on fix. training data not know how works.

try running tesseract in 1 of single column page segmentation modes:

tesseract input.tif output-filename --psm 6

by default tesseract expects page of text when segments image. if you're seeking ocr small region try different segmentation mode, using -psm argument. note adding white border text tightly cropped may help, see issue 398.

to see complete list of supported page segmentation modes, use tesseract -h. here's [ed: excerpt only] list of 3.21:

fully automatic page segmentation, no osd. (default)

assume single column of text of variable sizes.

assume single uniform block of vertically aligned text.

assume single uniform block of text.

see examples here: #using-different-page-segmentation-modes

Search This Blog

Sort

Tesseract OCR text order for documents with tables or rows -

Comments

Post a Comment

Popular posts from this blog

node.js - Mongoose: Cast to ObjectId failed for value on newly created object after setting the value -

[C++][SFML 2.2] Strange Performance Issues - Moving Mouse Lowers CPU Usage -

ios - Possible to get UIButton sizeThatFits to work? -