Hi all,
I’m in a bit of a quandary here regarding Unicode surrogates, specifically, the Plane 1 characters, though I imagine its the same for all Planes above 0. Say I want to print the Unicode character for a full moon, which in Plane 1 is 1F315, which I can’t use since my program is using UTF-8 not -16. My thought was to use the surrogate pair D83C and DF15. If I run it in IDLE, I get:
>>> print("\ud83C\uDF15")
🌕
>>>
However when I try it in a script, I get:
print("\uD83C\uDD93")
...
Traceback (most recent call last):
File "C:/Python/Astronomy/Planet_Icons.py", line 46, in <module>
print("\uD83C\uDD93")
UnicodeEncodeError: 'utf-8' codec can't encode character '\ud83c' in position 0: surrogates not allowed
I’ve tried it as a variable as well:
spam = u"\uD83C\uDD93"
print(spam)
...
Traceback (most recent call last):
File "C:/Python/Astronomy/Planet_Icons.py", line 44, in <module>
print (spam)
UnicodeEncodeError: 'utf-8' codec can't encode character '\ud83c' in position 0: surrogates not allowed
Am I missing something? Some ‘import’ . I’ve spent several hours trying to find the answer on the internets to no avail, so I thought I would turn to the experts here :).
Again, I’m using Python 3.4 and if it matters, I’m using PyCharm’s IDE.
Thanks for the reply arvidjaar. First let me preface this by saying I am new to Python (and by extension, Unicode). Therefore, I’ll refer to two web pages, the first being the Unicode ‘layout’, the second being a brief explanation of how it works (at least in a manor that didn’t give me a migraine):
So I am guessing that, no, I don’t have 32 bit support (by the way, these results are the same on both my windows 8.1 and opensuse computers, two separate machines). Anyway, because they decided they needed more ‘space’ for stuff they came up with UTF-16 and created ‘surrogate pairs’. As I said, the pairs work fine in IDLE, so I have access to the extended sets, but when I try to use it in a script, I get the error. The fact it works in IDLE and not the script leads me to think it’s not a problem with Python, but rather my ignorance.
Well, I have an update, sort of. Apparently both machines, windoze and Linux and both versions of python, 3.4 and 2.7 only allow me to decode the range #0000-#ffff. Using the \U, I have to pad with 4 0’s, ie \U00002468, but still won’t let me go beyond 4 characters, so \U0001F315 won’t work. Surrogate pairs work in IDLE but not in a script.
As usual, it’s probably something simple and basic I’m missing, but right now I feel like (Python source code u"\U0001F4A9") so I’m going to call it quits for tonight and try again tomorrow
Python 2.7.6 (default, Nov 21 2013, 15:55:38) [GCC] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> unichr(0x1F315)
u'\U0001f315'
>>> print u'\U0001f315'
🌕
>>>
Notice, I get a blank line after the print statement. May I ask, which font you are using in IDLE? Could that be the problem (please don’t be something that simple ) ?
IDLE = Python’s basic IDE, (Integrated DeveLopment Environment), possibly named after Monty Pythons Eric Idle? I think we are talking about the same thing though.
Author Guido van Rossum says IDLE stands for “Integrated DeveLopment Environment”, and since van Rossum named the language Python partly to honor British comedy group Monty Python, the name IDLE was probably also chosen partly to honor Eric Idle, one of Monty Python’s founding members.
Btw, there’s also an (Qt based) Python IDE named “Eric”:
In this case question about font support is meaningless. Python outputs sequence of characters; it is your terminal program that actually renders this sequence of characters on screen. Your first example is strong indication that whatever you use as your terminal program (you never even mentioned your environment so far) is set to expect UTF-16 as character encoding, so attempt to output UTF-8 is bound to fail.
Not sure what you mean by ‘environment’. Running OS 13.1, KDE4 desktop, Python 2.7.6, PyCharm (for writing my Python scripts) and IDLE, console is konsole. Or, there’s everything :
Unfortunately I do not use KDE for quite some time, so hopefully someone else may help here. But if I remember correctly, KDE had own language settings; value of LANG (or output of ‘locale’) shows only what variables are set in shell started by konsole; they do not necessary imply anything about konsole settings or what input it expects. E.g. GNOME terminal allows you to explicitly set encoding.
Konsole does as well.
It should be UTF-8 by default, but check the setting anyway, it’s in the menu: View->Set Encoding, this should probably be set to Unicode->UTF-8 (haven’t completely followed the thread).
You can set the default encoding in the profile settings (Settings->Edit Current Profile…, or Settings->Manage Profiles…)
Thanks wolfi, just checked and yes, it is set for utf-8. As far as I can tell, everything is utf-8 (Linux/openSuse default?). It’s a puzzler. Just can’t seem to display anything beyond #ffff, just get an empty square box. I even tried from the terminal: