Need tool to convert PDF to text

I am using SuSE-11.1 (32bit). One of the tasks I need to do is to convert articles from the gazette of commerce from PDF to plain text. I tried to do this with pdftotext from the xpdf package. However, the PDF input is difficult to convert. pdftotext is loosing spaces between words and adds additional (wrong) spaces at other places.

Question: do you know of other free open source tools which I could try?

Not sure - Something I have thought about myself. But you can use Okular and select text in there and paste to a text file.

1 Like

Thanks for the pointer. I forgot to say: I am one of those old fashioned command line guys. My app will run daily as a cron job.

vodoo wrote:
> Thanks for the pointer. I forgot to say: I am one of those old fashioned
> command line guys. My app will run daily as a cron job.

after the cron couldn’t you run a sed/script to strip out unneeded
spaces and a spell checker to add spaces into run together words…

i have Adobe Reader 8 for Linux installed (9 is available)…it has a
button to “Save as Text”…i’ve looked at “acroread -man” in a
terminal but do not see a command line switch to do the same, but it
MUST be available from somewhere, somehow…if so you could pipe
through to happiness…

the stock reader has a -toPostScript switch…do you have something
that converts PS direct to text?

and, there is a save “as rich text format” plug-in…

i suspect a visit to the Adobe site and/or community would be worth
your time…

i bet this problem has been solved before (you might try a google)…


goldie
Give a hacker a fish and you feed him for a day.
Teach man and you feed him for a lifetime.

goldie wrote:

> vodoo wrote:
>> Thanks for the pointer. I forgot to say: I am one of those old fashioned
>> command line guys. My app will run daily as a cron job.
>
>
> after the cron couldn’t you run a sed/script to strip out unneeded
> spaces and a spell checker to add spaces into run together words…
>
> i have Adobe Reader 8 for Linux installed (9 is available)…it has a
> button to “Save as Text”…i’ve looked at “acroread -man” in a
> terminal but do not see a command line switch to do the same, but it
> MUST be available from somewhere, somehow…if so you could pipe
> through to happiness…
>
> the stock reader has a -toPostScript switch…do you have something
> that converts PS direct to text?
>
> and, there is a save “as rich text format” plug-in…
>
> i suspect a visit to the Adobe site and/or community would be worth
> your time…
>
> i bet this problem has been solved before (you might try a google)…

There is a slight difference between the Acrobat plugin for Firefox and the
standalone reader in that the plugin restricts you to saving as pdf while
the standalone reader has the “save as text” option. PITA on downloads!


Will Honea

Try ‘pdfedit’ : it has an option to save file as text.
Webpin

Thanks to everyone who helped.

@zmdmw52: pdfedit is a very interesting app. It does a much better job than pdftotext. I still have to figure out how to run the conversion from the command line. This seems possible but I’m struggling with the syntax.

You might also want to try pdfsam - that has a lot of cmd-line options, IIRC.

Here are 2 picts of the PDF-related packages on the Debian-based Linux Mint 7 (on laptop); many of them should be available for openSUSE as well, Webpin or sofware search on openSUSE should give an indication.

[1]
http://thumbnails19.imagebam.com/4757/149a6c47567379.gif](http://www.imagebam.com/image/149a6c47567379)

[2]
http://thumbnails3.imagebam.com/4757/fc2ab947567382.gif](http://www.imagebam.com/image/fc2ab947567382)

Have a look at Convert PDF to Word (DOC) — 100% Free! a free on line service

Sorry, second pict is incorrect, but can’t edit that post. Will update the correct pict later.

These are the complete package lists; … throwing in the kitchen sink here, but I think pdfsam & pstotext should support command-line options.

(cont’d from earlier reply above)

2]
http://thumbnails7.imagebam.com/4771/ca007947709459.gif](http://www.imagebam.com/image/ca007947709459)

3]
http://thumbnails9.imagebam.com/4771/a622aa47708994.gif](http://www.imagebam.com/image/a622aa47708994)

Hi

Just thought to post my solution and say thanks for your valuable input. The pointer given by zmdmw52 for pdfedit led me on the right track. The very friendly developers of pdfedit have created a standalone version of pdf_to_text linked to their (much better) libraries. It compiles well on SuSE-11.1 and will make it into the CVS repo of pdfedit soon (well, that’s what I hope).

To sum it up: it works. Problem solved.

hi buddy, that is not a tough work. you just need to download aimage converter which are ubiquitous on the internet. follow the steps in the page and then the convertion might be finished.