søndag den 30. marts 2008

Python Unicode lessons from the school of hard knocks


Unicode is becomming increasingly popular, but is often misunderstood. That is a pitty, as it is really not that difficult. Especially since unicode is a really powerfull tool to know, and it becomes the standard string type in Python 3K.

There are many unicode articles around, but they are often very "theoretical" and computer science like. So I have collected af few examples I have learned in the school of hard knocks, that are practically oriented.


I will use "unicode" as shorthand for a unicode string, and "string" as a shorthand for a plain Python string. I live in Europe and am used to using latin-1 as default encoding. Whenever I write latin-1 just substitute that with your own language default encoding. Should work just fine.

I remember when I tried to understand unicode that I could not get my head around when to use encode and decode. In these examples I practically only go from unicode to string. So I use the encode() method for most examples. I believe this makes it easier to understand and remember.

If there is popular demand I might write an article using only the decode() method.

Working with unicode in code



First examples from a python console:


>>> 'this is a python string'
'this is a python string'

>>> u'this is a python unicode string'
u'this is a python unicode string'


Not much difference there. That is because they both contain only ascii characters. When I try to insert a danish character it changes:


>>> 'this is a pythøn string'
'this is a pyth\xc3\xb8n string'

>>> u'this is a pythøn unicode string'
u'this is a pyth\xf8n unicode string'


The string example shows something interresting as it shows the ø as '\xc3\xb8'. Whenever you see international characters showing up as two encoded characters/bytes like this, it is usually a sign that you are seeing a utf-8 encoded string.

It is by no means the law, but it is a good rule of thumb.

In this example the string is encoded to utf-8 because that is the default I use in the console window that runs the examples.

Thinking about unicode



A good way to think about the difference of unicode and string is as a text and a binary file.

Unicode is the text and the string is the binary file format. So when you want to save your text somewhere you save it in a file format. Different file formats you can use are latin-1 (iso-8859-1), ascii, utf-8 etc.

You convert to the correct file format by using the 'encode()' method. Like this:


>>> u'this is a pythøn unicode string'.encode('latin-1')
'this is a pyth\xf8n unicode string'

>>> u'this is a pythøn unicode string'.encode('utf-8')
'this is a pyth\xc3\xb8n unicode string'

>>> u'this is a pythøn unicode string'.encode('ascii')
Traceback (most recent call last):
File "", line 1, in ?
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf8' in position 14: ordinal not in range(128)


Whoops. ascii is an illegal file format for text with international characters. That is an important lesson.

You can get from unicode to every other encoding without any other information about the unicode string. If the encoding supports the characters your unicode use. So unicode is the singular type from and to which all the others text formats can be converted.

unicode to anything



String encodings are archeological. Well sometimes you even get the feeling that they are geological. Especially when writing something to support an RFC. Each geological layer is there for backwards compatibility.

First there was ascii, then iso-8859-1 (latin-1), iso-8859-15 (adds the € sign and more) and finally utf-8. They are mostly supported in that order too. The older the encoding, the better support it has in software out there in the wild.

Theoretically you can just use utf-8 for it all end be done with it. But I have had to rewrite almost any email/maillist application I have written to support latin-1 or other encodings. Many mail clients supports utf-8 correctly these days. But many web based clients do not. Neither hotmail nor gmail does actually.

Most likely you will be writing software that works perfectly, and passes every test and mail client you and your customer have in your organisations. But when the brand new maillist goes live, customers complains about unreadable characters.

So it can make good sense to try and encode unicode into a gradually more "modern" encoding. Trying the older encodings first, and if that fails then try newer ones. This function does that::


# -*- coding: utf-8 -*-
>>> def optimal_encode(st):
>>> for encoding in ['ascii','iso-8859-1','iso-8859-15','utf-8']:
>>> try:
>>> return (encoding, st.encode(encoding))
>>> except UnicodeEncodeError:
>>> pass
>>> raise UnicodeError, 'Could not find encoding'
>>>

>>> st = u'this is a pythøn unicod€ string'
>>> print optimal_encode(st)

('iso-8859-15', 'this is a pyth\xf8n unicod\xa4 string')



The € sign makes it return 'iso-8859-15'. If that encoding was not in the list, it would return::


('utf-8', 'this is a pyth\xc3\xb8n unicod\xe2\x82\xac string')


Working with unicode in your editor



Any modern text editor can handle utf-8 encoding. So preferably you should use that in your Python files. You tell Python that your files are utf-8 encoded by adding an encoding declaration to the top of you file.

# -*- coding: utf-8 -*-

When this is done, you can write international characters directly in your source code and every string in your file is utf-8 encoded. This makes it true that:


u'pythøn'.encode('utf-8') == 'pythøn'


Saving unicode in files



Unicode only exists in memory. You cannot write it to a text file unless you write it as pickled data. But nobody else would then be able to read it, and you cannot look at it in an ordinary text editor.

To put your data into a file, you must encode it first.


>>> st
u'this is a pyth\xf8n unicode string'
>>> f = open('unicodetest.txt', 'w')
>>> f.write(st)
Traceback (most recent call last):
File "", line 1, in ?
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf8' in position 14: ordinal not in range(128)
>>> st_utf8 = st.encode('utf-8')
>>> f.write(st_utf8)
>>> f.close()


When you try to write the unicode string to a file, it tries to convert it to a string and fails. But when you encode it first there is no problem.


There is a special 2 byte code that can be inserted into the beginning of a text file to mark it as some kind of unicode encoded string. It is called a Byte Order Mark (BOM)

There are a few of those for the different encodings, but I have only ever had to use utf-8


>>> import codecs
>>> codecs.BOM_UTF8
'\xef\xbb\xbf'
>>> f.write(codecs.BOM_UTF8 + st_utf8)


The Byte Order Mark (BOM) is used especially on Windows. Frankly it is a bit of a mess in Python until 2.5. But many text editors can recognize it and then knows automatically that the file is utf-8 encoded.

As far as I can figure it out, the Page Template skin system in Zope does recognize the BOM. So you can use international characters in Page Templates.

At least I have tried editing files that where utf-8 encoded, but the international characters displayed wrong. They looked like: 'this is a pyth\xc3\xb8n string' So Zopes Page Template system got them wrong. Other times I have written them and they looked like 'this is a pythøn string'.

I asume that is due to the difference in having a correct BOM or not that Zope can recognize.

Generally though I work on many different Zope/OS system combos. So when I see these kind of problems in html, I generally take the lazy way out and just use html entities so it looks like: 'this is a pythøn string' :-s

Unicode in html



In html you can represent international characters as both html entities (eg ø) and as encoded strings. Normally you will use utf-8 for encoded strings.

If you choose the encoded strings you must tell what the encoding is. You do this by setting it as meta data in the head:


<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />


Cou can then use normal procedures for converting unicode html to utf-8 strings.

Encoding strings



Unicode is not the only python class to have an encode() method. The string object also have one. That can be handy in many cases.

If you need to use a url, or any other string with special characters, as a filename it will cause you problems. Any string can be encoded in a simpler format that can be used as a filename or even as part of a path.

The hex encoding is the simplest, but base64 is the most space efficient.


>>> 'http://mxm-mad-science.blogspot.com/'.encode('hex')
'687474703a2f2f6d786d2d6d61642d736369656e63652e626c6f6773706f742e636f6d2f'

>>> 'http://mxm-mad-science.blogspot.com/'.encode('base64')
'aHR0cDovL214bS1tYWQtc2NpZW5jZS5ibG9nc3BvdC5jb20v\n'


A caveat here is that base64 always have a newline character in the end, and it might have a '=' as padding. So you need to modify it a bit to remove those:


>>> 'http://mxm-mad-science.blogspot.com/'.encode('base64')[:-1].strip('=')
'aHR0cDovL214bS1tYWQtc2NpZW5jZS5ibG9nc3BvdC5jb20v'


And when you want to decode it again you must always append a padding character. Otherwise it will fail about 50% of the time ...:


>>> ('aHR0cDovL214bS1tYWQtc2NpZW5jZS5ibG9nc3BvdC5jb20v' + '=').decode('base64')
'http://mxm-mad-science.blogspot.com/'


I once made some mailinglist software in Zope, where The best solution was to save the subscribers with their emails as the id of the subscriber object. But Zope does not accept @ in a path. All I then had to do to make it work was:


>>> id = 'maxm@mxm.dk'.encode('hex')
>>> id
'6d61786d406d786d2e646b'


What have this got to do with unicode you might ask? Well you can use this method to use unicode string as filenames without needing to remove special characters.

First make the unicode string, then convert it to utf-8 and then convert that to hex or base64.


>>> st = u'this is a pythøn unicod€ string/text'
>>> st_utf8 = st.encode('utf-8')
>>> st_utf8
'this is a pyth\xc3\xb8n unicod\xe2\x82\xac string/text'
>>> st_hex = st_utf8.encode('hex')
>>> st_hex
'7468697320697320612070797468c3b86e20756e69636f64e282ac20737472696e672f74657874'
>>> st == st_hex.decode('hex').decode('utf-8')
True


Unicode in emails



Once upon a time when Bill Gates invented email he thought "7 bits is more than enough for every character." Well ok. Maybe it was not Bill Gates. But someone must have thought it. As that is the foundation that email is built upon.

Body Text



Actully, email messages can contain 8 bit characters. But SMTP can only transmit 7 bit messages. And as Dolly Parton once said: You cannot put 10 punds of potatoes into a five pound sack. So if your are sending your email over SMTP. like everyone is, you must convert your code to 7 bits. This is called content transfer encoding.

First you make the body text in unicode.


>>> st = u'this is a pythøn unicode string'


Then you must convert it to the string encoding you want to send it as. Like latin-1, utf-8 etc.


>>> st_latin1 = st.encode('iso-8859-1')


Python recognise 'latin-1' as an encoding. Mail systems does not, so it is safer to use 'iso-8859-1'. Python will translate it automatically in the email module, but I have been bitten when composing emails wihtout it, so I have made it a habbit to always use the long form.

Latin-1 is still an 8 bit string. So we must content transfer encode that to a 7 bit ascii string.

A simple email message is made like this. Note that the set_payload() method does not do any encoding. You merely tell it what encoding your string is already in:


>>> st = u'this is a pythøn unicode string'
>>> st_latin1 = st.encode('iso-8859-1')
>>> from email.Message import Message
>>> msg = Message()
>>> msg.set_payload(st_latin1, 'iso-8859-1')
>>> str(msg)
'From nobody Sun Mar 30 15:13:31 2008\nMIME-Version: 1.0\nContent-Type: text/plain; charset="iso-8859-1"\nContent-Transfer-Encoding: quoted-printable\n\nthis is a pyth=F8n unicode string'


The email module can also encode as base64, but that is mostly used for 8 bit binary content as it makes the files smaller than quoted-printable would.

You could use base64 for all email content types, but keeping text messages human readable is simply more practical.

Sending HTML



Sending HTML works exactly like plain text from a unicode point of view. HTML files can also be made as unicode and then encoded as utf-8 or latin-1 etc. Just set the content type:


>>> del msg['Content-Type']
>>> msg.add_header('Content-Type', 'text/html', charset='iso-8859-1')
>>> str(msg)
'From nobody Sun Mar 30 15:23:14 2008\nMIME-Version: 1.0\nContent-Transfer-Encoding: quoted-printable\nContent-Transfer-Encoding: base64\nContent-Type: text/html; charset="iso-8859-1"\n\ndGhpcyBpcyBhIHB5dGg9RjhuIHVuaWNvZGUgc3RyaW5n'


Based on the conent type the email module chooses to convert html to base64 for you.

Email Headers



Email headers are special, and especially ugly. You press send on your new maillist software that has worked correctly during testing. Then all of a sudden there is one of them nasty international characters in the subject.

You did set the charset to iso-8859-1 in the add_header() method. So why does it fail?

Well because each header field needs to be set individually. The content of the headers has nothing to do with the message content of the text (payload).

you can set it like this:


>>> from email.Header import Header
>>> from email.Message import Message
>>> msg = Message()
>>> subject = u'This is a pythøn subject'
>>> subject_latin1 = subject.encode('latin-1')
>>> h = Header(subject_latin1, 'iso-8859-1')
>>> msg['Subject'] = h
>>> msg.as_string()
'Subject: =?iso-8859-1?q?This_is_a_pyth=F8n_subject?=\n\n'


The email headers like To, From etc. are a little bit tricky.


>>> from email.Utils import formataddr
>>> email = u'maxm@mxm.dk'.encode('latin-1')
>>> name = u'max møller'.encode('latin-1')
>>> From = formataddr( (name, email) )
>>> From
'max m\xf8ller '
>>> from email.Header import Header
>>> h = Header(From, 'iso-8859-1')
>>> str(h)
'=?iso-8859-1?q?max_m=F8ller_=3Cmaxm=40mxm=2Edk=3E?='
>>> msg = Message()
>>> msg['From'] = From
>>> msg.as_string()
'From: max m\xf8ller \n\n'


UTF-7 and IMAP



If you are writing an email client in Python you will most likely need to support Imap. Imap has folders, and their names use a special encoding not seen anywhere else. It is an encoding based on utf-7. It is not a common problem so I will not use much space on it here. But I have written an encoder for it that you can get here:

http://svn.plone.org/svn/collective/mxmImapClient/trunk/imapUTF7.py


>>> from imapUTF7 import imapUTF7Encode
>>> st_utf7_imap = imapUTF7Encode(st)


More about the issue can be found here:

5.1.3. Mailbox International Naming Convention
http://www.faqs.org/rfcs/rfc2060.html



More info and links



http://www.tbray.org/ongoing/When/200x/2003/04/26/UTF

http://evanjones.ca/python-utf8.html