Mental Jetsam

By Peter Finch

Archive for the ‘Python’ Category

Using Unicode characters in python

Posted by pcfinch on June 7, 2007

It’s really easy to work with strings in Python, but when it comes to handling Unicode there are a few issues that you may have to deal with. The main problem you will have is using Unicode characters with devices (consoles) or in database that do not support Unicode. If you have tried printing a Unicode string and got the following message then you will have experienced the issue.

>>> string = u'\\u7279\\u6b8a Unicode'
>>> print string
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'latin-1' codec can't encode characters 
in position 0-1: ordinal not in range(256)

The issue is that Python is unable to convert the Unicode string into the encoding of the current terminal. A similar problem can also happen when trying to put Unicode data into a database that does not accept [has not been configured] to use unicode characters or send it via E-Mail. A couple of simple tricks using the very powerful “encode()” function can help a lot have make your code more resilient.

To convert a Uncode string so it can be displayed on an ASCII screen.

>>> print string.encode('ascii','replace')
?? Unicode
>>> print string.encode('ascii','ignore')

A more useful approach is to escape the characters into another encoding. My favorite is to use XML character entities. The string can then be safely put into a the database field, sent out in E-Mails or placed in a HTML page.

>>> string.encode('ascii', 'xmlcharrefreplace')
'& #29305;& #27530; Unicode'


Posted in Python | Leave a Comment »

Unescape a Python “escaped” string

Posted by pcfinch on April 13, 2007

Python has a very useful regular expression function to escape special characters out a string. Oddly, there is no reverse function. Note that python itself will automatically escape the backslash when printing out the string. e.g..

>>> a = re.escape('Special \\#`1\\')
>>> a
'Special\\ \\\\\\#\\`1'\\\\

A simple way to “unescape” the string is to use a regular expression again. The following RE searches the string to a backslash followed by any character and replaced it with that character. The RE selects the character in to a group (.) and then uses that group in the substitution string 1.

>>> z = re.sub(r'\\(.)', r'\1', a)
>>> z
'Special \\#`1\\'

The tick here is the back reference to the character following the escaping “\”. You may think that all you need to do is replace all the “\” characters in the escaped string with nothing “”. Unfortunately, that doesn’t work correctly when you have an escaped “\” e.g. “\\” (which results in an escaped version “\\\\”) … confused?. Performing a simple replace on “\” will result in empty string.

Posted in Python | 4 Comments »