Tesseract OCR text order for documents with tables or rows -
i using tesseract ocr convert scanned pdf's plain text. overall highly effective having issues order text scanned. documents tabular data seem scan down column column when seems more natural way scan row row. small scale example be:
this column a, row 1 column b, row 1 column c, row 1 column a, row 2 column b, row 2 column c, row 2
is yielding following text:
this column a, row 1 column a, row 2 column b, row 1 column b, row 2 column c, row 1 column c, row 2
i starting read documentation , guess , test, brute force approach parameters documented here if has tackled issue similar, appreciate insight on fix. training data not know how works.
try running tesseract in 1 of single column page segmentation modes:
tesseract input.tif output-filename --psm 6
by default tesseract expects page of text when segments image. if you're seeking ocr small region try different segmentation mode, using
-psm
argument. note adding white border text tightly cropped may help, see issue 398.to see complete list of supported page segmentation modes, use
tesseract -h
. here's [ed: excerpt only] list of 3.21:
- fully automatic page segmentation, no osd. (default)
- assume single column of text of variable sizes.
- assume single uniform block of vertically aligned text.
- assume single uniform block of text.
see examples here: #using-different-page-segmentation-modes
Comments
Post a Comment