Posts tagged with tesseract

OCROpus - next gen. OCR for Linux

I was recently asked to scan in a small book, with relatively small text for addition to a database that could then be looked up by typing key words. The first step is to scan and OCR the entire book. Fortunately I've got all the necessary software to do scanning (sane and xsane) and my scanner was automatically recognised by Ubuntu.

The hard part was finding a suitable OCR software for Linux. In the past finding such a beast was a sad state of affairs in the Linux world indeed. With revived interest though, last year saw a few Google summer of code projects being released including tesseract and ocropus. Since they are very recent additions, the software is not exactly mature, but its the best there is, boasting a 95% accuracy.

I installed Ocropus and Tesseract from subversion, since it makes sense to check out the latest release. Though there are some recent releases in the Ubuntu repositories, I couldn't get OCROpus to recognise the needed extras. So the instructions I used are as follows:

svn checkout http://tesseract-ocr.googlecode.com/svn/trunk/ tesseract-ocr-read-only svn checkout http://iulib.googlecode.com/svn/trunk/ iulib-read-only svn checkout http://ocropus.googlecode.com/svn/trunk/ ocropus-read-only

Tesseract source has a bug that doesn't allow it to compile with gcc 4.3 (Intrepid Ibex comes with this default) so you need to install this patch by downloading to the directory where tesseract-ocr-read-only is located, and running:

patch -p1 tesseract+gcc-4.3.diff

Then u can do ./configure, followed by make and finally make install.

To install iulib, go into its directory and do:

./configure && make && sudo make install

The documentation session also use OpenFST which you can download here. There is a problem however, with building using the latest Ubuntu, so I chose not to use it.

To build ocropus, go into the ocropus-read-only folder and do:

./configure --without-fst --without-leptonica

make

sudo make install

Right now, OCROpus works only as a command line tool, with some frontends coming soon one hopes. In any case, using it from the command line is not too hard if all you want is to convert to a html formatted page:

ocroscript recognise

You can also recognize a sequence of pages by listing them one by one in a file, say, file-list by doing:

ocroscript recognize @file-list > text.html WHAT DO THE ROLLS ROYCE PHANTOM AND THE HYUNDAI GENESIS HAVE IN COMMON? HINT: IT’S NOT THE PRICE. For starters——they both share a 17-speaker Lexic0n® 7.1 surround sound system} Now, we don’t suppose you’ll confuse a Genesis with a Rolls—Royce anytime soon, but these two luxury cars do share more vital appointments than you might expect. For instance—a quiet cabin—assm·ed (like the Phantoms) by whisper valves that select an alternate exhaust at low speeds to reduce noise, and by acoustic laminated windows. The car’s trailblazing ergonomics are exceptional too. A widely acclaimed DIS knob gives you intuitive access to GPS, the sound system, and any_ Bluetooth" phone. Outside, the Genesis glows with a finish so nearly perfect that (just like Rol1s—Royce) we have to use robots to achieve it. No wonder it looks so good. In fact, if you’d rather have money than a hood ornament, it may tend to look even better than a Rolls-Royce.

Rate It! (Average 0, 0 votes)