Python 3.4 and unicode surrogates

Hi all,
I’m in a bit of a quandary here regarding Unicode surrogates, specifically, the Plane 1 characters, though I imagine its the same for all Planes above 0. Say I want to print the Unicode character for a full moon, which in Plane 1 is 1F315, which I can’t use since my program is using UTF-8 not -16. My thought was to use the surrogate pair D83C and DF15. If I run it in IDLE, I get:


>>> print("\ud83C\uDF15")
🌕
>>>

However when I try it in a script, I get:


print("\uD83C\uDD93")
...
Traceback (most recent call last):
  File "C:/Python/Astronomy/Planet_Icons.py", line 46, in <module>
    print("\uD83C\uDD93")
UnicodeEncodeError: 'utf-8' codec can't encode character '\ud83c' in position 0: surrogates not allowed

I’ve tried it as a variable as well:


spam = u"\uD83C\uDD93"
print(spam)
...
Traceback (most recent call last):
  File "C:/Python/Astronomy/Planet_Icons.py", line 44, in <module>
    print (spam)
UnicodeEncodeError: 'utf-8' codec can't encode character '\ud83c' in position 0: surrogates not allowed

Am I missing something? Some ‘import’ . I’ve spent several hours trying to find the answer on the internets to no avail, so I thought I would turn to the experts here :).

Again, I’m using Python 3.4 and if it matters, I’m using PyCharm’s IDE.

Thanks, as always, for taking the time.

Plane 1 is 1F315, which I can’t use since my program is using UTF-8 not -16

Not sure I understand this. UTF-8 supports full 32 bit unicode range. Is it specific Python limitation?

Thanks for the reply arvidjaar. First let me preface this by saying I am new to Python (and by extension, Unicode). Therefore, I’ll refer to two web pages, the first being the Unicode ‘layout’, the second being a brief explanation of how it works (at least in a manor that didn’t give me a migraine):

https://blog.jcoglan.com/2014/06/17/utf-8-its-what-strings-are-made-of/

If I run sys.maxunicode and sys.maxunicode.bit_length() I get:


>>> import sys
>>> sys.maxunicode
1114111
>>>
>>> sys.maxunicode.bit_length()
21
>>>

So I am guessing that, no, I don’t have 32 bit support (by the way, these results are the same on both my windows 8.1 and opensuse computers, two separate machines). Anyway, because they decided they needed more ‘space’ for stuff they came up with UTF-16 and created ‘surrogate pairs’. As I said, the pairs work fine in IDLE, so I have access to the extended sets, but when I try to use it in a script, I get the error. The fact it works in IDLE and not the script leads me to think it’s not a problem with Python, but rather my ignorance.

How is UTF-16 relevant here? UTF-8 for U+1F315 is F09F8C95.

You’re right. After a little more reading, it would seem I shouldn’t be using surrogate pairs with UTF-8. Your answer leaves me with two questions:

  1. what would the proper print statement be and
  2. how did you get the ‘F09F8C95’. Could you point me in the direction on how you made that conversion? Ok, skip this one, I found where to find this.

Still can’t seem to get it to work in a ‘print’ statement though. :shame:

Well, I have an update, sort of. Apparently both machines, windoze and Linux and both versions of python, 3.4 and 2.7 only allow me to decode the range #0000-#ffff. Using the \U, I have to pad with 4 0’s, ie \U00002468, but still won’t let me go beyond 4 characters, so \U0001F315 won’t work. Surrogate pairs work in IDLE but not in a script.

As usual, it’s probably something simple and basic I’m missing, but right now I feel like :poop: (Python source code u"\U0001F4A9") so I’m going to call it quits for tonight and try again tomorrow

How your locale is set?

Python 2.7.6 (default, Nov 21 2013, 15:55:38) [GCC] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> unichr(0x1F315)
u'\U0001f315'
>>> print u'\U0001f315'
🌕
>>> 

Sorry, got tied up today. I’ll pick this up tomorrow and post the info.

All right then, here is the ‘locale’ info


Yes Master? locale -a |grep 'en_US'en_US
en_US.iso885915
en_US.utf8


Yes Master? locale
LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=
Yes Master?

My results from your example:


>>> unichr(0x1f315)
u'\U0001f315'
>>> print u'\U0001f315'


>>> 

Notice, I get a blank line after the print statement. May I ask, which font you are using in IDLE? Could that be the problem (please don’t be something that simple :stuck_out_tongue: ) ?

I have no idea what IDLE is. The example was made using interactive python in gnome-terminal on openSUSE 13.1 (GNOME 3.10).

IDLE = Python’s basic IDE, (Integrated DeveLopment Environment), possibly named after Monty Pythons Eric Idle? I think we are talking about the same thing though.

Probably.
From http://en.wikipedia.org/wiki/IDLE_(Python):

Author Guido van Rossum says IDLE stands for “Integrated DeveLopment Environment”, and since van Rossum named the language Python partly to honor British comedy group Monty Python, the name IDLE was probably also chosen partly to honor Eric Idle, one of Monty Python’s founding members.

Btw, there’s also an (Qt based) Python IDE named “Eric”: :wink:

In this case question about font support is meaningless. Python outputs sequence of characters; it is your terminal program that actually renders this sequence of characters on screen. Your first example is strong indication that whatever you use as your terminal program (you never even mentioned your environment so far) is set to expect UTF-16 as character encoding, so attempt to output UTF-8 is bound to fail.


bor@opensuse:~> PYTHONIOENCODING=utf_8 python -c "print u'\U0001f315'" | xxd
0000000: f09f 8c95 0a                             .....
bor@opensuse:~/src/grub> PYTHONIOENCODING=utf_16 python -c "print u'\U0001f315'" | xxd
0000000: fffe 3cd8 15df 0a                        ..<....
bor@opensuse:~> 

Wow, looks very all inclusive? I think it would take me longer to learn how to set it up than it’s taking me to learn Python :stuck_out_tongue:

Not sure what you mean by ‘environment’. Running OS 13.1, KDE4 desktop, Python 2.7.6, PyCharm (for writing my Python scripts) and IDLE, console is konsole. Or, there’s everything :slight_smile: :


/usr/bin/python2.7 /home/sparkz/PycharmProjects/untitled/environment.py
Content-Type: text/plain


                          LESS -M -I -R 
                           CPU x86_64 
              KDE_FULL_SESSION true 
                         SHELL /bin/bash 
                 XDG_DATA_DIRS /usr/share:/usr/share:/etc/opt/kde3/share:/opt/kde3/share:/opt/kf5/share
                        GS_LIB /home/sparkz/.fonts 
                      HISTSIZE 1000 
    LESS_ADVANCED_PREPROCESSOR no 
                       MANPATH /usr/local/man:/usr/share/man:/opt/kde3/share/man 
                    XMODIFIERS @im=local 
                     JAVA_HOME /usr/lib64/jvm/java 
                   PROFILEREAD true 
               XDG_RUNTIME_DIR /run/user/1000 
                    PYTHONPATH /home/sparkz/PycharmProjects/untitled 
                XDG_SESSION_ID 1 
      DBUS_SESSION_BUS_ADDRESS unix:abstract=/tmp/dbus-4vuAhgJB3g,guid=18563e063a00a9c4af2f3dbd543fbb43 
                      SDK_HOME /usr/lib64/jvm/java 
               DESKTOP_SESSION default 
                   CONFIG_SITE /usr/share/site/x86_64-unknown-linux-gnu 
                   GTK_MODULES canberra-gtk-module 
                      HOSTNAME lenovo 
                PYCHARM_HOSTED 1 
                          MAIL /var/spool/mail/sparkz 
                      MACHTYPE x86_64-suse-linux 
                     JAVA_ROOT /usr/lib64/jvm/java 
                       MINICOM -c on 
                       CSHEDIT emacs 
                      LESSOPEN lessopen.sh %s 
                       CVS_RSH ssh 
                          USER sparkz
                       INPUTRC /home/sparkz/.inputrc 
                      XDG_VTNR 7 
                    DM_CONTROL /var/run/xdmctl 
              PYTHONUNBUFFERED 1 
                      JDK_HOME /usr/lib64/jvm/java 
               SESSION_MANAGER local/lenovo:@/tmp/.ICE-unix/1783,unix/lenovo:/tmp/.ICE-unix/1783 
                         SHLVL 2 
                 XCURSOR_THEME Oxygen_White 
                GPG_AGENT_INFO /tmp/gpg-fURCFs/S.gpg-agent:1718:1 
                          LANG en_US.UTF-8 
                   JAVA_BINDIR /usr/lib64/jvm/java/bin 
                QT_PLUGIN_PATH /home/sparkz/.kde4/lib64/kde4/plugins/:/usr/lib64/kde4/plugins/ 
                     CLASSPATH /home/sparkz/pycharm/pycharm-community-3.4.1/bin/../lib/bootstrap.jar:/home/sparkz/pycharm/pycharm-community-3.4.1/bin/../lib/extensions.jar:
				  /home/sparkz/pycharm/pycharm-community-3.4.1/bin/../lib/util.jar:/home/sparkz/pycharm/pycharm-community-3.4.1/bin/../lib/jdom.jar:
				  /home/sparkz/pycharm/pycharm-community-3.4.1/bin/../lib/log4j.jar:/home/sparkz/pycharm/pycharm-community-3.4.1/bin/../lib/trove4j.jar:
				  /home/sparkz/pycharm/pycharm-community-3.4.1/bin/../lib/jna.jar 
                XSESSION_IS_UP yes 
                 GTK2_RC_FILES /etc/gtk-2.0/gtkrc:/home/sparkz/.gtkrc-2.0:/home/sparkz/.kde4/share/config/gtkrc-2.0 
                       GPG_TTY not a tty 
                      XNLSPATH /usr/share/X11/nls 
                             _ /usr/lib64/jvm/java/bin/java 
                 GTK_IM_MODULE cedilla 
               XDG_CONFIG_DIRS /etc/xdg 
                 WINDOWMANAGER /usr/bin/startkde 
                  GTK_RC_FILES /etc/gtk/gtkrc:/home/sparkz/.gtkrc:/home/sparkz/.kde4/share/config/gtkrc 
                         PAGER less 
                 KDE_MULTIHEAD false 
                 QT_SYSTEM_DIR /usr/share/desktop-data 
               LD_LIBRARY_PATH /home/sparkz/pycharm/pycharm-community-3.4.1/bin: 


                          HOME /home/sparkz
                       DISPLAY :0 
                QT_IM_SWITCHER imsw-multi 
                       NLSPATH /usr/dt/lib/nls/msg/%L/%N.cat 
            G_BROKEN_FILENAMES 1 
                        TMPDIR /tmp 
                        OSTYPE linux 
                    NNTPSERVER news 
           G_FILENAME_ENCODING @locale,UTF-8,ISO-8859-15,CP1252 
                          HOST lenovo 
           XDG_CURRENT_DESKTOP KDE 
                   FROM_HEADER  
                     LESSCLOSE lessclose.sh %s %s 
                       USE_FAM  
                      JRE_HOME /usr/lib64/jvm/java/jre 
                     XKEYSYMDB /usr/X11R6/lib/X11/XKeysymDB 
                          MORE -sl 
              PYTHONIOENCODING UTF-8 
                      HOSTTYPE x86_64 
                  QT_IM_MODULE xim 
                       LOGNAME sparkz 
                      XDG_SEAT seat0 
                          PATH /home/sparkz/bin:/usr/local/bin:/usr/bin:/bin:/usr/bin/X11:/usr/X11R6/bin:/usr/games:/opt/kde3/bin:/usr/sbin 
           KDE_SESSION_VERSION 4 
                          TERM xterm 
                    WINDOWPATH 7 
                     COLORTERM 1 
               KDE_SESSION_UID 1000 
                   XDM_MANAGED method=classic 
                       LESSKEY /etc/lesskey.bin 
                 PYTHONSTARTUP /etc/pythonstart 
                           PWD /home/frank 
            DESKTOP_STARTUP_ID lenovo;1413462967;36370;1846_TIME175510 
               XFILESEARCHPATH /usr/dt/app-defaults/%L/Dt 
            XAUTHLOCALHOSTNAME lenovo 


Process finished with exit code 0

Unfortunately I do not use KDE for quite some time, so hopefully someone else may help here. But if I remember correctly, KDE had own language settings; value of LANG (or output of ‘locale’) shows only what variables are set in shell started by konsole; they do not necessary imply anything about konsole settings or what input it expects. E.g. GNOME terminal allows you to explicitly set encoding.

Konsole does as well.
It should be UTF-8 by default, but check the setting anyway, it’s in the menu: View->Set Encoding, this should probably be set to Unicode->UTF-8 (haven’t completely followed the thread).
You can set the default encoding in the profile settings (Settings->Edit Current Profile…, or Settings->Manage Profiles…)

Thanks wolfi, just checked and yes, it is set for utf-8. As far as I can tell, everything is utf-8 (Linux/openSuse default?). It’s a puzzler. Just can’t seem to display anything beyond #ffff, just get an empty square box. I even tried from the terminal:


Yes Master> echo -e '\U1f315'
Yes Master> �
Yes Master> echo -e '\U0001f315'
Yes Master> �

I’m beginning to think it’s a conspiracy of international proportions.

I’d say that the font you are using just doesn’t contain any characters beyond 0xffff, in particular not the one you are trying to print.

I’m beginning to think it’s a conspiracy of international proportions.

I doubt that. :wink:

But your very first post clearly shows that you could display U+1f315 when using UTF-16 encoding.

just get an empty square box.

Is it empty or is it square box with hex representation of Unicode value?