Results 1 to 10 of 10

Thread: Okular and diacritic marks

  1. #1

    Default Okular and diacritic marks

    I have a problem with diacritic marks and Okular on openSUSE 13.1 (64bit). I am trying to copy a text that is written in Czech. All the diacritic marks are displayed correctly but when I try copying the text the "ž" is displayed as a "t". It looks like the other diacritic marks are recognized (ě,š,č,ř,ý,á,í,é). Does anyone have an idea what causes this problem or better, how to solve it?

  2. #2

    Default Re: Okular and diacritic marks

    I forgot to mention a few details that may be important. The file I tried to copy from is a PDF document. Okular is version 0.17.5, KDE 4.11.5.

  3. #3
    Join Date
    Jun 2008
    Location
    West Yorkshire, UK
    Posts
    3,430

    Default Re: Okular and diacritic marks

    The obvious explanation is that the PDF was not encoded in utf8. The point about PDFs is that they display whatever was originally written independently of what you have on your computer. So you can read a document containing fonts which you do not have. If the fonts used are not utf8 fonts, anything you copy will be interpreted as utf8.

    One possibility is that the original encoding was ISO-8859-2 which includes ž in a different position from Latin1 (ISO-8859-1) - which tends to be the same as utf8.

  4. #4

    Default Re: Okular and diacritic marks

    Thanks for your answer. That sounds logical. I didn't know that it works this way. Is there a way to copy the text to libre office or a texteditor with the right interpretation? It is an ebook, so correcting letters is beyond all question.

  5. #5
    Join Date
    Jun 2008
    Location
    West Yorkshire, UK
    Posts
    3,430

    Default Re: Okular and diacritic marks

    AFAIK you cannot do it using the clipboard but try setting Tools>Encoding>Central European>ISO 8859-2 in Kate and then saving the text to Kate. Save the file and then open LibreOffice and select Text Encoded to open it - this will open a dialogue asking you to specify the encoding.

    Of course, if you are able to print to file from the PDF you may be able to ignore Kate and go straight to LibreOffice.

    I am guessing ISO 8859-2; so you may have to try the other Central European encodings in Kate.

  6. #6

    Default Re: Okular and diacritic marks

    I tried it with Kate but I don't get the result I had hoped for. Maybe I am doing something wrong.
    If you have the time you could take a look for yourself:
    http://www.gchd.cz/download/knihy/po...lostranske.pdf

    On page 5 of the document you could check the 2nd word of line 1 (of course every other word with a "ž").

    I did miss a thing, though. The "ž" is not interpreted as a "t" but a "ţ". I didn't notice because the spellchecker underlined the word. Maybe I can find another source for the text (it is old) but I'd still be interested in a solution on how to handle this problem.

    Thanks again for your suggestions!

  7. #7
    Join Date
    Jun 2008
    Location
    West Yorkshire, UK
    Posts
    3,430

    Default Re: Okular and diacritic marks

    I suspect there is no solution other than to export the whole PDF as text and do a search and replace ţ with ž. I have tried all the available encodings in Kate and 852 in LibreOffice and utf8 gives the best results apart from this one character. I did find this page http://luki.sdf-eu.org/txt/cs-encodings-faq.html which explained the origin of the problem you are facing but I could not find a solution to it within the explanations given (that I hadn't already tried).

  8. #8
    Join Date
    Feb 2009
    Location
    Spain
    Posts
    25,547

    Default Re: Okular and diacritic marks

    On 2014-07-06 11:46, Montymo wrote:
    >
    > I tried it with Kate but I don't get the result I had hoped for. Maybe I
    > am doing something wrong.
    > If you have the time you could take a look for yourself:
    > http://www.gchd.cz/download/knihy/po...lostranske.pdf
    >
    > On page 5 of the document you could check the 2nd word of line 1 (of
    > course every other word with a "ž").


    I opened the file with "pdefedit", then saved to text. The result was,
    besides the formatting, correct UTF-8, as far as I can see.

    Or, you can directly open the PDF in LibreOffice. I just did, and the
    result looks very good.

    I also opened it using "Calibre" (I have version 1.18). This one will
    convert to epub, which is a very good format for viewing on readers.

    I don't understand the language in the book, but I would probably import
    to libreoffice, reformat, to get automatic page reflowing, then save as
    ..doc, and this one import in Calibre. This way you get an epub file
    which will reflow automatically on display on small screen devices such
    as ebook readers.

    --
    Cheers / Saludos,

    Carlos E. R.
    (from 13.1 x86_64 "Bottle" at Telcontar)

  9. #9

    Default Re: Okular and diacritic marks

    Quote Originally Posted by john_hudson View Post
    I suspect there is no solution other than to export the whole PDF as text and do a search and replace ţ with ž.
    That's what I did! Thanks for trying and for the link. Looks complicated. In the end it wasn't a big deal because the "ţ" is not part of the Czech alphabet. It would've been impossible if it had been a "t" like I assumed at first.

    @robin_listas
    The devil is in the details. It looks okay but it isn't. We were talking about one letter of the Czech alphabet but that is enough to not be regocnized by online dictionaries. I am not a native speaker either but a learner. The purpose has been to copy parts of the book and combine them with the corresponding audio.

  10. #10
    Join Date
    Feb 2009
    Location
    Spain
    Posts
    25,547

    Default Re: Okular and diacritic marks

    On 2014-07-07 11:26, Montymo wrote:
    >
    > john_hudson;2652672 Wrote:
    >> I suspect there is no solution other than to export the whole PDF as
    >> text and do a search and replace ţ with ž. That's what I did! Thanks for trying and for the link. Looks

    > complicated. In the end it wasn't a big deal because the "ţ" is not
    > part of the Czech alphabet. It would've been impossible if it had been a
    > "t" like I assumed at first.
    >
    > @robin_listas
    > The devil is in the details. It looks okay but it isn't. We were talking
    > about one letter of the Czech alphabet but that is enough to not be
    > regocnized by online dictionaries. I am not a native speaker either but
    > a learner. The purpose has been to copy parts of the book and combine
    > them with the corresponding audio.
    >
    >


    This phrase?

    "Cítíme, ţe jsme v místnosti zcela uzavřené. Čirá, hluboká kolem"

    It does not read "ž" but "ţ". Is that the error you mean?

    Even acrobat gets it wrong, either by pasting or conversion to txt.


    I have done a little experiment.

    First, convert the pdf to postscript, in acrobat, via printing to file.
    This file, before further use, has to be cleaned, using ps2ps (like
    ps2ps p1.ps p2.ps). This second file you can work with.

    However, I don't see a tool to convert postscript to UTF text, only to
    ascii text. That paragraph results thus:


    "Ci'ti'me, z^e jsme v mi'stnosti zcela uzavr^ene'. C^ira',
    hluboka' kolem na's tma, ani nejmens^i' skulinou nevnika' odnikud
    s^ero ; vs^ude jen"


    Which I think unreadable (and not only because I can't read the language).


    What other tool can be used to convert postcript to text?


    Notice that evince and okular are unable to display this postcript file;
    only "gv" can do it. This is a problem with acrobat output, though. And
    I have not seen paste possible from 'gv'.


    Maybe defining a plain text printer in CUPS, and printing to it - but
    based on the results of "ps2ascii" I'm afraid it will not work, either.
    But in your case, I would experiment.


    I tried conversion of that pdf to epub, with calibre. It does convert,
    but makes the same error. You could ask on the calibre help forum, they
    are nice chaps and might know if the problem is correctable.

    And anyway, the automated conversion to epub doesn't produce a nice
    result: lines do not reflow. I don't like it.

    Ah, wait! Activating heuristics produces a very much nicer result,
    regarding reflow of lines. The "ž" vs "ţ" error is still there,
    though... so you could ask them. I would :-)



    Mmmm... The man page for ps2ascii mentions the font problem. And it
    suggests using pstotext instead, which I do not have; but I see some
    repos having it, so you could try and report back ;-)


    (I did try, briefly, and the text output was unreadable (got "Non-ISO
    extended-ASCII text"). Possibly I did not find the correct command line)

    --
    Cheers / Saludos,

    Carlos E. R.
    (from 13.1 x86_64 "Bottle" at Telcontar)

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •