Okular and diacritic marks

I have a problem with diacritic marks and Okular on openSUSE 13.1 (64bit). I am trying to copy a text that is written in Czech. All the diacritic marks are displayed correctly but when I try copying the text the “ž” is displayed as a “t”. It looks like the other diacritic marks are recognized (ě,š,č,ř,ý,á,í,é). Does anyone have an idea what causes this problem or better, how to solve it?

I forgot to mention a few details that may be important. The file I tried to copy from is a PDF document. Okular is version 0.17.5, KDE 4.11.5.

The obvious explanation is that the PDF was not encoded in utf8. The point about PDFs is that they display whatever was originally written independently of what you have on your computer. So you can read a document containing fonts which you do not have. If the fonts used are not utf8 fonts, anything you copy will be interpreted as utf8.

One possibility is that the original encoding was ISO-8859-2 which includes ž in a different position from Latin1 (ISO-8859-1) - which tends to be the same as utf8.

Thanks for your answer. That sounds logical. I didn’t know that it works this way. Is there a way to copy the text to libre office or a texteditor with the right interpretation? It is an ebook, so correcting letters is beyond all question.

AFAIK you cannot do it using the clipboard but try setting Tools>Encoding>Central European>ISO 8859-2 in Kate and then saving the text to Kate. Save the file and then open LibreOffice and select Text Encoded to open it - this will open a dialogue asking you to specify the encoding.

Of course, if you are able to print to file from the PDF you may be able to ignore Kate and go straight to LibreOffice.

I am guessing ISO 8859-2; so you may have to try the other Central European encodings in Kate.

I tried it with Kate but I don’t get the result I had hoped for. Maybe I am doing something wrong.
If you have the time you could take a look for yourself:
http://www.gchd.cz/download/knihy/povidky_malostranske.pdf

On page 5 of the document you could check the 2nd word of line 1 (of course every other word with a “ž”).

I did miss a thing, though. The “ž” is not interpreted as a “t” but a “ţ”. I didn’t notice because the spellchecker underlined the word. Maybe I can find another source for the text (it is old) but I’d still be interested in a solution on how to handle this problem.

Thanks again for your suggestions!

I suspect there is no solution other than to export the whole PDF as text and do a search and replace ţ with ž. I have tried all the available encodings in Kate and 852 in LibreOffice and utf8 gives the best results apart from this one character. I did find this page http://luki.sdf-eu.org/txt/cs-encodings-faq.html which explained the origin of the problem you are facing but I could not find a solution to it within the explanations given (that I hadn’t already tried).

On 2014-07-06 11:46, Montymo wrote:
>
> I tried it with Kate but I don’t get the result I had hoped for. Maybe I
> am doing something wrong.
> If you have the time you could take a look for yourself:
> http://www.gchd.cz/download/knihy/povidky_malostranske.pdf
>
> On page 5 of the document you could check the 2nd word of line 1 (of
> course every other word with a “ž”).

I opened the file with “pdefedit”, then saved to text. The result was,
besides the formatting, correct UTF-8, as far as I can see.

Or, you can directly open the PDF in LibreOffice. I just did, and the
result looks very good.

I also opened it using “Calibre” (I have version 1.18). This one will
convert to epub, which is a very good format for viewing on readers.

I don’t understand the language in the book, but I would probably import
to libreoffice, reformat, to get automatic page reflowing, then save as
…doc, and this one import in Calibre. This way you get an epub file
which will reflow automatically on display on small screen devices such
as ebook readers.


Cheers / Saludos,

Carlos E. R.
(from 13.1 x86_64 “Bottle” at Telcontar)

That’s what I did! Thanks for trying and for the link. Looks complicated. In the end it wasn’t a big deal because the “ţ” is not part of the Czech alphabet. It would’ve been impossible if it had been a “t” like I assumed at first.

@robin_listas
The devil is in the details. It looks okay but it isn’t. We were talking about one letter of the Czech alphabet but that is enough to not be regocnized by online dictionaries. I am not a native speaker either but a learner. The purpose has been to copy parts of the book and combine them with the corresponding audio.

On 2014-07-07 11:26, Montymo wrote:
>
> john_hudson;2652672 Wrote:
>> I suspect there is no solution other than to export the whole PDF as
>> text and do a search and replace ţ with ž. That’s what I did! Thanks for trying and for the link. Looks
> complicated. In the end it wasn’t a big deal because the “ţ” is not
> part of the Czech alphabet. It would’ve been impossible if it had been a
> “t” like I assumed at first.
>
> @robin_listas
> The devil is in the details. It looks okay but it isn’t. We were talking
> about one letter of the Czech alphabet but that is enough to not be
> regocnized by online dictionaries. I am not a native speaker either but
> a learner. The purpose has been to copy parts of the book and combine
> them with the corresponding audio.
>
>

This phrase?

“Cítíme, ţe jsme v místnosti zcela uzavřené. Čirá, hluboká kolem”

It does not read “ž” but “ţ”. Is that the error you mean?

Even acrobat gets it wrong, either by pasting or conversion to txt.

I have done a little experiment.

First, convert the pdf to postscript, in acrobat, via printing to file.
This file, before further use, has to be cleaned, using ps2ps (like
ps2ps p1.ps p2.ps). This second file you can work with.

However, I don’t see a tool to convert postscript to UTF text, only to
ascii text. That paragraph results thus:

“Ci’ti’me, z^e jsme v mi’stnosti zcela uzavr^ene’. C^ira’,
hluboka’ kolem na’s tma, ani nejmens^i’ skulinou nevnika’ odnikud
s^ero ; vs^ude jen”

Which I think unreadable (and not only because I can’t read the language).

What other tool can be used to convert postcript to text?

Notice that evince and okular are unable to display this postcript file;
only “gv” can do it. This is a problem with acrobat output, though. And
I have not seen paste possible from ‘gv’.

Maybe defining a plain text printer in CUPS, and printing to it - but
based on the results of “ps2ascii” I’m afraid it will not work, either.
But in your case, I would experiment.

I tried conversion of that pdf to epub, with calibre. It does convert,
but makes the same error. You could ask on the calibre help forum, they
are nice chaps and might know if the problem is correctable.

And anyway, the automated conversion to epub doesn’t produce a nice
result: lines do not reflow. I don’t like it.

Ah, wait! Activating heuristics produces a very much nicer result,
regarding reflow of lines. The “ž” vs “ţ” error is still there,
though… so you could ask them. I would :slight_smile:

Mmmm… The man page for ps2ascii mentions the font problem. And it
suggests using pstotext instead, which I do not have; but I see some
repos having it, so you could try and report back :wink:

(I did try, briefly, and the text output was unreadable (got “Non-ISO
extended-ASCII text”). Possibly I did not find the correct command line)


Cheers / Saludos,

Carlos E. R.
(from 13.1 x86_64 “Bottle” at Telcontar)