Page 3 of 3 FirstFirst 123
Results 21 to 30 of 30

Thread: Python 3.4 and unicode surrogates

  1. #21
    Join Date
    Jun 2008
    Location
    Netherlands
    Posts
    25,245

    Default Re: Python 3.4 and unicode surrogates

    Quote Originally Posted by sparkz_alot View Post
    Thanks wolfi, just checked and yes, it is set for utf-8. As far as I can tell, everything is utf-8 (Linux/openSuse default?). It's a puzzler. Just can't seem to display anything beyond #ffff, just get an empty square box. I even tried from the terminal:
    Code:
    Yes Master> echo -e '\U1f315'
    Yes Master> ๏ฟฝ
    Yes Master> echo -e '\U0001f315'
    Yes Master> ๏ฟฝ
    I'm beginning to think it's a conspiracy of international proportions.
    I have no problems:
    Code:
    henk@boven:~> echo -e '\u0905'
    เค…
    henk@boven:~>
    which is correct. I assume that your case is alo correct because the output is interpreted as one character only. The only thing is that you seem not have a font in that application (terminal emulator) that contains the character. I have.
    BTW, it could be that you will NOT see the character I see on my screen and in this post, because you may not have the font that contains this character in your browser.
    Henk van Velden

  2. #22
    Join Date
    Sep 2012
    Posts
    5,185

    Default Re: Python 3.4 and unicode surrogates

    Quote Originally Posted by sparkz_alot View Post
    I even tried from the terminal
    You mean, linux text mode console? This is not going to work - text mode font has space for 512 characters only, so anything it does not know about is printed as this funny question mark.

  3. #23
    Join Date
    Sep 2012
    Posts
    5,185

    Default Re: Python 3.4 and unicode surrogates

    Quote Originally Posted by hcvv View Post
    I have no problems:
    But according to OP problems are with characters above U+FFFF and your example shows character below it. Did you try the same echo -e '\U1f315' ?

  4. #24
    Join Date
    Jun 2008
    Location
    Netherlands
    Posts
    25,245

    Default Re: Python 3.4 and unicode surrogates

    Code:
    henk@boven:~> echo -e '\U1f315' 
    ๐ŸŒ•
    henk@boven:~>
    which, IMHO shows that the output is correct, but that no font is available to show the glyph FULL MOON SYMBOL.
    The Application interpreted the bytes it got (F09F8C95) and translated that back into U+1F315. Not finding a glyph for that in any of the installed fonts, it created a box with 01F 315 in it. A replacement glyph, but nevertheless correct.

    I originally did not post to this thread because I have no Python knowledge, nor did I understand the word "surrogates" in this context. I only posted later because I saw the OP trying to prove something with echo, where I thought his interpretation was not correct.
    In the meantime I have read more in this thread and I understand that surrogates have something to do with UUTF-16. I do know something about Unicode and UTF-8, but, UTF-16 being something of a niche in Linux and certainly not the default encoding used, it may be that my post is not very interesting for the IP's problem.
    Henk van Velden

  5. #25
    Join Date
    Mar 2014
    Location
    US
    Posts
    115

    Default Re: Python 3.4 and unicode surrogates

    which, IMHO shows that the output is correct, but that no font is available to show the glyph FULL MOON SYMBOL.
    The Application interpreted the bytes it got (F09F8C95) and translated that back into U+1F315. Not finding a glyph for that in any of the installed fonts, it created a box with 01F 315 in it. A replacement glyph, but nevertheless correct.

    I originally did not post to this thread because I have no Python knowledge, nor did I understand the word "surrogates" in this context. I only posted later because I saw the OP trying to prove something with echo, where I thought his interpretation was not correct.
    In the meantime I have read more in this thread and I understand that surrogates have something to do with UUTF-16. I do know something about Unicode and UTF-8, but, UTF-16 being something of a niche in Linux and certainly not the default encoding used, it may be that my post is not very interesting for the IP's problem.
    Henk van Velden
    Yours and the others replies are always not only interesting, but very helpful and are always appreciated. From what I've read, UTF-16 is the default Unicode for Windows (leave it to them to be contrary). I installed Freefont on the Windows comp and its equivalent, (I think) TEXfreefont on the openSuse comp. It supposedly supports oodles of Unicode glyphs. No joy though, except...(please see my next post)
    If it ain't broke, I just haven't gotten to it yet.
    openSUSE 42.1 (Leap) (x86_64); Intel Dual Core Proc @ 2.66 GHz; KDE Desktop

  6. #26
    Join Date
    Mar 2014
    Location
    US
    Posts
    115

    Default Re: Python 3.4 and unicode surrogates

    So I wanted to check what my new fonts gave me and wrote this small script (sorry if it's not pretty):
    Code:
    import codecs
    #
    # First run (0-55295) use "w" option, second run (57344-1114111)
    # use "a"  option
    #
    file = codecs.open("unicode_symbols", "w", "utf-8")
    #
    # For Plane 0 (BMP) change values to '"0, 55295". Note: these two ranges
    # exclude 55296-57543, which are used as surrogate pairs for UTF-16
    #
    for a in range(57344, 1114111):
        file.write('Decimal: ')
        file.write(str(a))
        file.write('  Hex: ')
        file.write(str(hex(a)))
        file.write('  Binary: ')
        file.write(str(bin(a)))
        file.write('  Character: ')
        file.write(str(chr(a)))
        file.write("\n")
        a += a
    file.close()
    Which gave me (in part):
    Code:
    ...
    Decimal: 127760  Hex: 0x1f310  Binary: 0b11111001100010000  Character: ๐ŸŒ
    Decimal: 127761  Hex: 0x1f311  Binary: 0b11111001100010001  Character: ๐ŸŒ‘
    Decimal: 127762  Hex: 0x1f312  Binary: 0b11111001100010010  Character: ๐ŸŒ’
    Decimal: 127763  Hex: 0x1f313  Binary: 0b11111001100010011  Character: ๐ŸŒ“
    Decimal: 127764  Hex: 0x1f314  Binary: 0b11111001100010100  Character: ๐ŸŒ”
    Decimal: 127765  Hex: 0x1f315  Binary: 0b11111001100010101  Character: ๐ŸŒ•
    Decimal: 127766  Hex: 0x1f316  Binary: 0b11111001100010110  Character: ๐ŸŒ–
    Decimal: 127767  Hex: 0x1f317  Binary: 0b11111001100010111  Character: ๐ŸŒ—
    Decimal: 127768  Hex: 0x1f318  Binary: 0b11111001100011000  Character: ๐ŸŒ˜
    Decimal: 127769  Hex: 0x1f319  Binary: 0b11111001100011001  Character: ๐ŸŒ™
    Decimal: 127770  Hex: 0x1f31a  Binary: 0b11111001100011010  Character: ๐ŸŒš
    Decimal: 127771  Hex: 0x1f31b  Binary: 0b11111001100011011  Character: ๐ŸŒ›
    Decimal: 127772  Hex: 0x1f31c  Binary: 0b11111001100011100  Character: ๐ŸŒœ
    ...
    Lo and behold, there was the elusive full moon So it's there, I just can't display it on the screen. I am now beginning to suspect it might be the pyCharm IDE I'm using, since it's the only common thread between the Windows comp and openSuse comp. My next step will be to write a small script in notepad (Windows) and Kwrite (openSuse) and see if that makes a difference.
    If it ain't broke, I just haven't gotten to it yet.
    openSUSE 42.1 (Leap) (x86_64); Intel Dual Core Proc @ 2.66 GHz; KDE Desktop

  7. #27
    Join Date
    Sep 2012
    Posts
    5,185

    Default Re: Python 3.4 and unicode surrogates

    Lo and behold, there was the elusive full moon

    I'm confused. This sounds like echoing UTF-8 sequence by shell does not work, but outputting the same sequence on the same terminal by Python does. I have hard time to believe it ... but whatever works for you.

  8. #28
    Join Date
    Mar 2014
    Location
    US
    Posts
    115

    Default Re: Python 3.4 and unicode surrogates

    Quote Originally Posted by arvidjaar View Post
    I'm confused. This sounds like echoing UTF-8 sequence by shell does not work, but outputting the same sequence on the same terminal by Python does. I have hard time to believe it ... but whatever works for you.
    [/SIZE][/FONT][/FONT][/FONT][/FONT]
    Actually the little script outputs to a text file, not the terminal. Neither the shell nor python will output the characters to the screen. So, I'm confused too. Well, openSuse 13.2 is coming out soon, I'll do a clean install and see if maybe something I did jazzed things up.
    If it ain't broke, I just haven't gotten to it yet.
    openSUSE 42.1 (Leap) (x86_64); Intel Dual Core Proc @ 2.66 GHz; KDE Desktop

  9. #29
    Join Date
    Sep 2012
    Posts
    5,185

    Default Re: Python 3.4 and unicode surrogates

    Quote Originally Posted by sparkz_alot View Post
    Actually the little script outputs to a text file, not the terminal.
    Which of course changes everything, because Python can easily use different encoding when printing to stdout and when printing to file. You omit critical details which makes it pointless to continue to play guess games. Good luck.

  10. #30
    Join Date
    Mar 2014
    Location
    US
    Posts
    115

    Default Re: Python 3.4 and unicode surrogates

    Quote Originally Posted by arvidjaar View Post
    Which of course changes everything, because Python can easily use different encoding when printing to stdout and when printing to file. You omit critical details which makes it pointless to continue to play guess games. Good luck.
    Thank you for at least taking the time. Much appreciated.
    If it ain't broke, I just haven't gotten to it yet.
    openSUSE 42.1 (Leap) (x86_64); Intel Dual Core Proc @ 2.66 GHz; KDE Desktop

Page 3 of 3 FirstFirst 123

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •