OpenOffice>Export to PDF - Kooka (or other scanner program using SANE) - GOCR - OpenOffice to tidy up text file - not sure how you will do the last bit - depends on what is required.
>
> yes word is text!
> but some of the word in the files are in a pic box!
> that is where I am having troubles!
AH! I was wondering why you didn’t just save the pdf as text directly but
the presence of text as part of an image makes for a whole new ball game if
you want to extract the text from an image. The only solutions I can think
of all involve an ocr of the source as an image, not as a text doc so this
will be an interesting answer!
re 2. AFAIK you can only scan single pages in Linux, not bulk. So you either need to print out the PDF and scan each page separately or convert the PDF to single images.
In practice, to extract the text from a PDF I would never go this route; I would simply extract the text directly as a text file and any images as separate images and then reconstitute them.
There is now the option in OpenOffice of adding the Sun PDF extension which allows you to load a PDF in Draw and create an ODT file directly from it.
So one reason why you may be having difficulties is that there is no longer any reason for most people to take the route you are taking.
how to take pdf to a picture(ex jpg png exex…)
Im coping what text that are in the documents to a txt and just taking the pic in the file a moving to pdf
now pdf to txt!
so pdf–> (pic) → ocr or IRS —> txt
The simplest way of creating an image from a PDF is to open it in a viewer and take a screenshot. How sharp this will be for using with an OCR will depend on your screen resolution. Alternatively, print it out and then scan it as an image and not as text.