Python 3.4 and unicode surrogates

hcvv · October 19, 2014, 12:46pm

sparkz_alot:

Thanks wolfi, just checked and yes, it is set for utf-8. As far as I can tell, everything is utf-8 (Linux/openSuse default?). It’s a puzzler. Just can’t seem to display anything beyond #ffff, just get an empty square box. I even tried from the terminal:
Yes Master> echo -e '\U1f315'
Yes Master> �
Yes Master> echo -e '\U0001f315'
Yes Master> �
I’m beginning to think it’s a conspiracy of international proportions.

I have no problems:

henk@boven:~> echo -e '\u0905'
अ
henk@boven:~>

which is correct. I assume that your case is alo correct because the output is interpreted as one character only. The only thing is that you seem not have a font in that application (terminal emulator) that contains the character. I have.
BTW, it could be that you will NOT see the character I see on my screen and in this post, because you may not have the font that contains this character in your browser.

arvidjaar · October 19, 2014, 1:01pm

You mean, linux text mode console? This is not going to work - text mode font has space for 512 characters only, so anything it does not know about is printed as this funny question mark.

arvidjaar · October 19, 2014, 1:03pm

But according to OP problems are with characters above U+FFFF and your example shows character below it. Did you try the same echo -e ‘\U1f315’ ?

hcvv · October 19, 2014, 2:23pm

henk@boven:~> echo -e '\U1f315' 
🌕
henk@boven:~>

which, IMHO shows that the output is correct, but that no font is available to show the glyph FULL MOON SYMBOL.
The Application interpreted the bytes it got (F09F8C95) and translated that back into U+1F315. Not finding a glyph for that in any of the installed fonts, it created a box with 01F 315 in it. A replacement glyph, but nevertheless correct.

I originally did not post to this thread because I have no Python knowledge, nor did I understand the word “surrogates” in this context. I only posted later because I saw the OP trying to prove something with echo, where I thought his interpretation was not correct.
In the meantime I have read more in this thread and I understand that surrogates have something to do with UUTF-16. I do know something about Unicode and UTF-8, but, UTF-16 being something of a niche in Linux and certainly not the default encoding used, it may be that my post is not very interesting for the IP’s problem.

sparkz_alot · October 19, 2014, 6:24pm

which, IMHO shows that the output is correct, but that no font is available to show the glyph FULL MOON SYMBOL.
The Application interpreted the bytes it got (F09F8C95) and translated that back into U+1F315. Not finding a glyph for that in any of the installed fonts, it created a box with 01F 315 in it. A replacement glyph, but nevertheless correct.

I originally did not post to this thread because I have no Python knowledge, nor did I understand the word “surrogates” in this context. I only posted later because I saw the OP trying to prove something with echo, where I thought his interpretation was not correct.
In the meantime I have read more in this thread and I understand that surrogates have something to do with UUTF-16. I do know something about Unicode and UTF-8, but, UTF-16 being something of a niche in Linux and certainly not the default encoding used, it may be that my post is not very interesting for the IP’s problem. Henk van Velden

Yours and the others replies are always not only interesting, but very helpful and are always appreciated. From what I’ve read, UTF-16 is the default Unicode for Windows (leave it to them to be contrary). I installed Freefont on the Windows comp and its equivalent, (I think) TEXfreefont on the openSuse comp. It supposedly supports oodles of Unicode glyphs. No joy though, except…(please see my next post)

sparkz_alot · October 19, 2014, 6:46pm

[FONT=system][FONT=verdana][FONT=Arial]So I wanted to check what my new fonts gave me and wrote this small script (sorry if it’s not pretty):


import codecs
#
# First run (0-55295) use "w" option, second run (57344-1114111)
# use "a"  option
#
file = codecs.open("unicode_symbols", "w", "utf-8")
#
# For Plane 0 (BMP) change values to '"0, 55295". Note: these two ranges
# exclude 55296-57543, which are used as surrogate pairs for UTF-16
#
for a in range(57344, 1114111):
    file.write('Decimal: ')
    file.write(str(a))
    file.write('  Hex: ')
    file.write(str(hex(a)))
    file.write('  Binary: ')
    file.write(str(bin(a)))
    file.write('  Character: ')
    file.write(str(chr(a)))
    file.write("
")
    a += a
file.close()

Which gave me (in part):


...
Decimal: 127760  Hex: 0x1f310  Binary: 0b11111001100010000  Character: 🌐
Decimal: 127761  Hex: 0x1f311  Binary: 0b11111001100010001  Character: 🌑
Decimal: 127762  Hex: 0x1f312  Binary: 0b11111001100010010  Character: 🌒
Decimal: 127763  Hex: 0x1f313  Binary: 0b11111001100010011  Character: 🌓
Decimal: 127764  Hex: 0x1f314  Binary: 0b11111001100010100  Character: 🌔
Decimal: 127765  Hex: 0x1f315  Binary: 0b11111001100010101  Character: 🌕
Decimal: 127766  Hex: 0x1f316  Binary: 0b11111001100010110  Character: 🌖
Decimal: 127767  Hex: 0x1f317  Binary: 0b11111001100010111  Character: 🌗
Decimal: 127768  Hex: 0x1f318  Binary: 0b11111001100011000  Character: 🌘
Decimal: 127769  Hex: 0x1f319  Binary: 0b11111001100011001  Character: 🌙
Decimal: 127770  Hex: 0x1f31a  Binary: 0b11111001100011010  Character: 🌚
Decimal: 127771  Hex: 0x1f31b  Binary: 0b11111001100011011  Character: 🌛
Decimal: 127772  Hex: 0x1f31c  Binary: 0b11111001100011100  Character: 🌜
...

Lo and behold, there was the elusive full moon So it’s there, I just can’t display it on the screen. I am now beginning to suspect it might be the pyCharm IDE I’m using, since it’s the only common thread between the Windows comp and openSuse comp. My next step will be to write a small script in notepad (Windows) and Kwrite (openSuse) and see if that makes a difference.[/FONT][/FONT][/FONT]

arvidjaar · October 20, 2014, 8:27am

[FONT=system][FONT=verdana][FONT=Arial]Lo and behold, there was the elusive full moon

I’m confused. This sounds like echoing UTF-8 sequence by shell does not work, but outputting the same sequence on the same terminal by Python does. I have hard time to believe it … but whatever works for you.
[/FONT][/FONT][/FONT]

sparkz_alot · October 21, 2014, 7:21pm

Actually the little script outputs to a text file, not the terminal. Neither the shell nor python will output the characters to the screen. So, I’m confused too. Well, openSuse 13.2 is coming out soon, I’ll do a clean install and see if maybe something I did jazzed things up.

arvidjaar · October 21, 2014, 8:12pm

Which of course changes everything, because Python can easily use different encoding when printing to stdout and when printing to file. You omit critical details which makes it pointless to continue to play guess games. Good luck.

sparkz_alot · October 21, 2014, 10:58pm

Thank you for at least taking the time. Much appreciated.