Unobfuscating unicode ubiquity: a practical guide to Unicode and UTF-8

December 2, 2015

Everybody’s probably seen it at least once:

“Viva Workiva!” Push

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 12: ordinal not in range(128)

Or this:

“Über Data!” Save

UnicodeEncodeError: 'ascii' codec can't encode character u'\xdc' in position 0: ordinal not in range(128)

Unicode errors. They’re the hobgoblins of all errors, and they tend to travel in packs. Thankfully, stomping them out is usually pretty easy, but it’s even better to be able to identify and prevent these errors from cropping up in the first place. But how?

The goal of this document is to discuss two things:

  1. What Unicode and UTF-8 are and why they’re important.
  2. How to take them into account when developing so as to avoid errors when attempting to transcode text.

Let’s begin!

Table of contents

Contents

Appendix (Modals)

Unicode: A new mindset

“Unicode has a different way of thinking about characters, and you have to understand the Unicode way of thinking of things or nothing will make sense.”

Joel Spolsky, The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

Before reading any further, if you haven’t read it yet, then I strongly recommend reading that article by Joel Spolsky. He does an excellent job of explaining where the motivations for Unicode came from (spoiler: accessibility, internationalization, and this weird thing called the internet…) and why the Unicode encoding known as UTF-8 is the definitive way of the future. Anything else should be considered old and broken, especially ASCII, which is only allowed to live on because UTF-8 is fully backwards-compatible with it: that is to say, every ASCII document is also a UTF-8 encoded Unicode document by design, which makes migration easy because you don’t actually have to do anything.

Next, take a look at Tom Christiansen’s “The Good, the Bad, and the (mostly) Ugly.” It’s a brief presentation that covers the state of Unicode support in various programming languages and environments. Tom highlights common pitfalls and even some outright impossible edge cases, due to incomplete Unicode implementations (or just poor language design) that are good for developers working in those languages to be aware of (particularly those needing to parse Unicode-laden strings with Unicode-laden regular expressions).

Okay, now that those are out of the way, we can continue. I reference those documents because I want to draw three points out of them jointly that I think will be of help going forward:

First, as mentioned in the “Absolute Minimum” article,

“There Ain’t No Such Thing As Plain Text.”

—Joel Spolsky

We often take it for granted that text—whether it’s Unicode or not—is text, and binary data is binary data. However, that’s not actually true: text is binary data, too, and unless you’re talking about cryptography, there’s really no such thing as plain text. Forget that plain text was ever a thing. As long as we’re talking about computers here, text is always going to be interpreted using some kind of encode/decode process. The question is, “Which kind?” The answer to that question used to be, “Well, that depends. What compiler are you using? What time period/nationality do you hail from? Who is the manufacturer of your terminal?” Then it became, at least for English speakers, “Eh, probably ASCII…? Try UCS-2 little-endian if not.” Since 1998, however, the answer has been a definitive and resounding “UTF-8.” Always.

Second, when dealing with any kind of text—I mean, binary data—Unicode or not, it can be tempting to think of characters in a string, but when dealing with Unicode, this isn’t exactly accurate. With Unicode, it’s better to think of code points in an object. In Unicode, one or more bytes make up a single code point, and while a single code point usually maps to a single character, this isn’t always the case. It’s not behavior that most English speakers usually get a chance to encounter, as it’s primarily used for things like accents and diacritics, alternative letter formats, or even—if you’re using UTF-16 encoding—a Unicode feature known as surrogate pairing. In any event, it’s good to know about this subtle difference to avoid tripping yourself up: code points—not characters, and definitely not bytes—are the things most languages will iterate over when dealing with Unicode objects.

Third, when it comes to higher-level programming languages, strings are really nothing more than objects of a special type. In some cases, their in-memory binary representations—the zeroes-and-ones that describe them—may be the same as what gets written out to disk in the end, but in other cases, they may not be the same at all. Whether they are the same or not depends not only on the language you’re using (e.g., Python versus JavaScript versus Dart versus Go, etc.) but the language implementation (e.g., CPython—which uses UTF-16 internally—versus IronPython versus Jython, or Node.js/V8—which use UCS-2—versus SpiderMonkey vs. Chakra, etc.). See Christiansen’s “The Good, the bad, and the (mostly) Ugly” paper for more information on this point. In the end, it’s better to think of any blob of text as an abstract Unicode object if it’s in memory, and as a serialized Unicode object only if it’s being transmitted or resides on disk.

To sum up:

  1. Unicode strings are really just objects consisting of many code points, which we can iterate over
  2. Objects must be serialized (encoded) before they can be written/stored/transmitted, and Unicode objects are no exception
  3. Though many encodings for Unicode objects exist, the standard encoding format for Unicode objects is UTF-8

Got all that? Good, we can proceed.

Q. “But wait! Is all this really necessary? What if I don’t need all that extra character support? All I want to do is write out a simple ASCII string without it erroring out!”

A. Is that "ASCII" string you want to output a hard-coded, literal string? If so, then any old print statement should do you just fine; you shouldn’t need to worry about this too much. (I put “ASCII” in quotes because, technically, it’s all UTF-8 now. You hear me? ASCII is dead.)

Wait, you say you got that string from a user? (Even if that user is you?) How? Through some kind of input box or command-line argument, you say…?

In that case, treat that user with some respect!

Give his or her textual data the full-citizenship rights it deserves by interpreting it as UTF-8!

UTF-8 is fully backwards-compatible with ASCII—to there’s no reason not to use it—and there is plenty of room for future growth in both the Unicode and UTF-8 specs—so UTF-8 is going to be around for a long, long time to come.

In short, the world is moving on, and failing to take into account the possibility that your endpoint or service or tool may be sent Unicode data someday will doom it to bugs at best, and premature obsolescence at worst. I’m not kidding.

Growth of unicode

Seriously though, Unicode ain’t actually that bad. It’s rather quite ingenious in both its simplicity and its scalability—and thankfully, it’s not that hard to take it into account. Let’s find out how.

The rule of thumb: ACE 👍

Without further ado, the following three points together form a general rule of thumb for dealing with any kind of textual data, in or out:

  1. Assume all inbound text is UTF-8 — unless it’s specifically documented as being otherwise.

  2. Create all strings in memory as Unicode objects — these “Unicode objects” may or may not use UTF-8 encoding behind-the-scenes, but you don’t need to worry about that: just let the language interpreter you’re using do its thing.

  3. Ensure all outbound text is UTF-8 — unless the external service you’re talking to explicitly requires some other form of textual encoding.

Text that is inbound/outbound can be coming from/going to a file on disk, an HTTP service or endpoint, a blob of textual data in a database, etc. It includes pretty much anything an end-user can supply or consume. It does not mean any old API that happens to be external to you or your team. Generally speaking, if you’re passing around objects in memory within the same language or environment, do so with Unicode objects, not UTF-8 encoded binary data (aka buffers or bytestrings).

The greater these rules are adhered to in your code base—whether manually by you, the programmer, or automatically by your language’s built-ins and other APIs—the less often you will find yourself encountering the dreaded Unicode(De|En)codeError.

Practical applications 💡 🎥

Listed in alphabetical order.

Dart 💘

Dart’s String API is already Unicode-aware. Huzzah!

Go 🚦

Go has a Unicode package that assists in dealing with Unicode characters.

HTML5

Always send your HTML output using UTF-8 encoding.

Never include a BOM (byte order mark) in your rendered HTML output.

Always define the character set in the first <meta> element within the <head> element, to ensure that it is within the first 512 bytes of the page, as some browsers only look at these first bytes before choosing a character set for the page:

<!DOCTYPE html>
<html>
  <head>
    <meta charset="UTF-8">
    […]

Note that some browsers will prefer the charset attribute of the HTTP Content-Type header over the <meta> element if both are defined and they diverge from one another.

Always verify that your documents and/or web server are actually using UTF-8 encoding…

UTF-8

HTTP 🌐 💻

Always set the Content-Type header to the correct MIME type, making sure to define a charset of UTF-8 when referencing any kind of textual data:

Content-Type: text/html; charset=UTF-8

“…New subtypes of the ‘text’ media type SHOULD NOT define a default ‘charset’ value. If there is a strong reason to do so despite this advice, they SHOULD use the ‘UTF-8’ [RFC3629] charset as the default.”

RFC 6657 § 3

JavaScript (Node/server-side) ☕ 📜

Node is smart, and most of its APIs use the 'utf8' codec by default. In the few cases where they don’t (e.g., fs.readFile()), they’re still pretty smart by returning a buffer object of binary bytes, leaving it up to the developer to choose what encoding to interpret the buffer as, rather than trying to stringify it using something horrible and ancient, like 'ascii'. Even so, because of the minor inconsistency here, it’s usually good when reading or writing text to be explicit and just specifically state that you want to use 'utf8'.

Reading

Always read as UTF-8. (If no encoding is specified, then a raw buffer is returned.)

var fs = require('fs');
var contents = fs.readFileSync('file.txt', {encoding: 'utf8'});

 

Writing

Always write as UTF-8. (In this case, 'utf8' is the default, but it’s good to be explicit anyway.)

fs.writeFileSync('file.txt', contents, {encoding: 'utf8'});

 

Gotchas

Unfortunately, many … See “JavaScript (browser/client-side)” below, and Christiansen’s “The Good, the Bad, and the (mostly) Ugly” paper for more information.

JavaScript (browser/client-side) ☕ 📜

ECMAScript5 (and thus JavaScript) is a black sheep when it comes to dealing with Unicode and UTF-8. It certainly supports Unicode. Kind of. At least, it’s got excellent support for code points in Plane 0—also known as the Basic Multilingual Plane—which is home to over 150 of the world’s languages, symbols, and punctuation. Internally, most JavaScript engines use a hacky style of the constant-width UCS-2 encoding, with support for UTF-16’s surrogate pairing tacked on almost as an afterthought in an effort to support code points in the other sixteen planes currently defined by the Unicode standard.

Without surrogate pairing, ECMAScript5 would be limited to only the first 65,536 (2¹⁶) code points of Unicode—or Plane 0—due to UCS-2’s only allowing up to sixteen bits for code point representation. (UTF-16, in spite of the name, is actually a variable-width codec, and it allows for up to 32 bits to represent Unicode code points from any plane, in addition to allowing surrogate pairs of code points for systems that can’t support more than a 16-bit width.) But there are an additional sixteen planes beyond Plane 0, consisting of 65,536 code points each, all of which need to be referenceable in order for a language to claim full support for Unicode. JavaScript only barely manages to accomplish this.

As an example, in order to render a fish (“🐟”), which is defined in Plane 1—the Supplementary Multilingual Plane—specifically in the Miscellaneous Symbols and Pictographs range—at code point U+1F41F, and which has a surrogate pair of code points in Plane 0 at U+D83D and U+DC1F, it is not the original Plane 1 code point which must be used to define it, but the surrogate pair in Plane 0:

> var fish = '\uD83D\uDC1F'
> fish
'🐟'

Attempting to use the definitive Plane 1 code point results in gibberish, because only the first four characters following \u are actually parsed:

> '\u1F41F'
'ὁF'

Not surprisingly, this can make tasks like iterating and looping difficult, because you can’t get an accurate string length:

> fish.length
2

It also means you can’t really trust the .charAt() method for any Unicode-laden string whatsoever:

> fish.charAt(0)
'�'
> fish.charAt(1)
'�'
> fish.charAt(2)
''

…or the .codePointAt() method, at least not for strings containing Unicode code points beyond Plane Zero:

> fish.codePointAt(0)
128031  // Decimal of 0x1F41F, the definitive Plane 1 code-point
> fish.codePointAt(1)
56351  // Decimal of 0xDC1F, the second Plane 0 surrogate
> fish.codePointAt(2)
undefined

Instead, the .charCodeAt() method should probably be used, since it’s a bit more consistent in its behavior:

> fish.charCodeAt(0)
55357  // Decimal of 0xD83D, the first Plane 0 surrogate
> fish.charCodeAt(1)
56351  // Decimal of 0xDC1F, the second Plane 0 surrogate
> fish.charCodeAt(2)
NaN

These surrogate pairs can then be recombined using String.fromCharCode():

> String.fromCharCode(fish.charCodeAt(0)) + String.fromCharCode(fish.charCodeAt(1))
'🐟'

However, if you really need to use the original code unit number to define a Unicode code point outside of Plane 0, you can kind of cheat by using String.fromCodePoint():

> String.fromCodePoint(0x1F41F)
'🐟'

It’s ugly. Again, if you haven’t done it yet, take a look at the JavaScript section of Christiansen's “The Good, the Bad, and the (mostly) Ugly” paper. The basic gist of the story is, as long as you’re dealing in Plane 0, you should be fine. But be careful if you venture beyond U+FFFF.

Python 🐍

Reading

Always read as binary and immediately decode, assuming UTF-8.

with open('file.txt', 'rb'):
    contents = f.read().decode('utf-8')

[…]

r = requests.get(url)
contents = r.content.decode('utf-8')

[…]

contents = sys.stdin.read().decode('utf-8')

 

Writing

Always encode as UTF-8, and write as binary.

with open('file.txt', 'wb') as f:
    f.write(contents.encode('utf-8'))

[…]

r = requests.post(url, data=contents.encode('utf-8'))

[…]

sys.stdout.write(contents.encode('utf-8'))

 

Gotchas

This seems all well and good, but this post wouldn’t be complete without at least attempting to cover the massive topic of Python’s many gotchas when it comes to dealing with Unicode and UTF-8.

Okay, first thing’s first: some of the official Python binaries share the same issue that JavaScript has in that they can’t easily reference code points beyond Plane 0. Only wide builds are capable of referencing the higher Unicode code points (c.f. PEP 261). To check if your build of Python is a wide or a narrow build, try the following:

>>> unichr(2**16)

If it works, you’ve got a wide Python build. If it doesn’t…

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: unichr() arg not in range(0x10000) (narrow Python build)

…it means that you may have difficulty when working with code points beyond U+FFFF. You can still construct and reference them with the capital \U prefix, however:

>>> print u'\U0001F41F'
🐟

But be aware that many of the issues that plague JavaScript-land will also plague users of narrow Python builds because on narrow builds, Python has to fall back to surrogate pairing:

>>> len(u'\U0001F41F')
2
>>> u'\U0001F41F'[0]
u'\ud83d'
>>> u'\U0001F41F'[1]
u'\udc1f'

Most of this has already been covered in the “JavaScript (browser/client)” section, so I don’t want to spend too much time on it. Fortunately, the issue has already been fixed (c.f. PEP 393), but only for Python version 3.3 and above. Python already has a decently sized chunk of gotchas and nuanced behaviors that we still need to cover without dwelling too long on the topic of narrow and wide builds. 

As most seasoned Python devs will be familiar with, a huge shift occurred from Python 2 to Python 3 regarding how strings are interpreted. In Python 2, the term string was a bit ambiguous, as it could mean either an 8-bit bytestring, aka a str-type object—which was the default—or a Unicode string, aka a unicode-type object, which was represented in memory using UTF-16 encoding (at least when using the standard Python language interpreter, CPython), but these objects could be re-encoded and written out using any supported textual codec. These UTF-16 Unicode objects identified themselves in the interactive console with a u prefix, e.g.:

>>> my_unicode_str
u'Hello, world!'

In Python 3, however, all strings—aka strs—are Unicode objects by default, thus removing the need for a unicode type at all. One of the goals of this switch was to change the way developers thought about textual data: instead of the divide being put up between 8-bit strings and Unicode strings, it is now a distinction between simply strings and any other kind of binary data. These purely binary objects—which may or may not be textual data at all—get represented in the interactive console with a b prefix, e.g.:

>>> my_binary_data
b'\x00\x01\x02Hello, world!\x00String of Strings\x00\x01\x02'

This behavior is much closer in line with the Unicode Mindset outlined above, and it generally makes dealing with all kinds of data—textual or not—a whole lot easier.

However, for your average Python 2 dev who works with Google App Engine to any extent, the benefits of this change aren’t necessarily as apparent or as quickly realized. If nothing else, it makes writing code that is both: A. Unicode-aware, and B. functional in both Python 2 and 3 environments (yes, we do have a few projects that have this requirement) that much more difficult.

Therefore, here is a list of some common dos and don’ts for those who wish to write Unicode-friendly Python, and in a widely-compatible and future-friendly way (note, all examples are Python 2, unless specifically stated as being otherwise):

  • DON’T use unicode() to convert str to unicode 

    It looks friendly…

    >>> print s1
    I'm a UTF-8 string!
    >>> type(s1)
    <type 'str'>
    >>> s2 = unicode(s1)
    >>> type(s2)
    <type 'unicode'>
    

    …but it’s a trap! 

    >>> print s1
    I’m also a UTF-8 string!
    >>> type(s1)
    <type 'str'>
    >>> s2 = unicode(s1)
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 1: ordinal not in range(128)
    

    Why does this happen? Pay close attention to the second character. In the first example, it’s one of those abbreviations for measurements in imperial feet that have been masquerading as apostrophes for the last two-hundred years (that is, since typewriters were invented). In the second example, it’s an apostrophe proper, which is not actually defined in ASCII, but which is defined in Unicode (specifically as code point U+2019, which can be typed out as ALT+0146 on Windows and OPT+SHIFT+] on Mac), and was represented in the string data above using UTF-8 encoding:

    >>> print repr(s1)
    'I\xe2\x80\x99m a UTF-8 string!'
    

    The real issue is that unicode() is more of a type caster than it is a type converter. Thus, while it will convert your single-byte bytestring objects into multi-byte UTF-16 Unicode objects in memory, it only works with ASCII-compatible bytestrings. If unicode() encounters any bytes with the high bit set, it completely freaks out because it isn’t designed to handle them.

    What’s more, if you need your code to work in Python 3, unicode is a no-go from the start, because as mentioned earlier, it doesn’t even exist as a type:

    >>> unicode
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    NameError: name 'unicode' is not defined
    

 

  • DO use .decode('utf-8') to convert str to unicode 

    Continuing with the example above, everything would have been fine (in both cases) if we had just done this from the get-go:

    >>> type(s1)
    <type 'str'>
    >>> s2 = s1.decode('utf-8')
    >>> type(s2)
    <type 'unicode'>
    

    In Python 3, the types will look a bit different, but the end result is the same:

    >>> type(s1)
    <class 'bytes'>
    >>> s2 = s1.decode('utf-8')
    >>> type(s2)
    <class 'str'>
    

 

  • DON’T use str() to serialize unicode to str 

    The reason as to “Why not?” is pretty much the same as the unicode() issue above. It seems to work…

    >>> s1
    u"I'm a UTF-8 string!"
    >>> str(s1)
    "I'm a UTF-8 string!"
    

    …until it doesn’t:

    >>> s2
    u'I\u2019m also a UTF-8 string!'
    >>> str(s2)
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' in position 1: ordinal not in range(128)
    

 

  • DO use .encode('utf-8') to serialize unicode to str 

    Works in Python 2…

    >>> type(s1)
    <type 'unicode'>
    >>> s2 = s1.encode('utf-8')
    >>> type(s2)
    <type 'str'>
    

    …and in Python 3:

    >>> type(s1)
    <class 'str'>
    >>> s2 = s1.encode('utf-8')
    >>> type(s2)
    <class 'bytes'>
    

 

  • DON’T use .encode() on bytestring objects 

    Unfortunately, .encode() isn’t a cure-all: you have to know when to use it. In Python 2, you can sometimes get away with using it excessively, because for single-byte, ASCII-compatible UTF-8 data, it acts as a no-op…

    >>> s1 = u'Hello, world!'
    >>> s1.encode('utf-8')
    'Hello, world!'
    >>> s1.encode('utf-8').encode('utf-8').encode('utf-8')
    'Hello, world!'
    

    …but for string objects containing multi-byte UTF-8 data, not so much:

    >>> s1 = u'Hello, snowman! ☃'
    >>> s1.encode('utf-8')
    'Hello, snowman! \xe2\x98\x83'
    >>> s1.encode('utf-8').encode('utf-8')
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 16: ordinal not in range(128)
    

    Python 3 tries to avoid tricking developers like this by simply removing the no-op behavior entirely: instead, Python 3 outright refuses to re-encode an already-encoded binary object (the .encode() method doesn’t exist for the bytes type):

    >>> s1
    b'Hello, world!'
    >>> s1.encode('utf-8')
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    AttributeError: 'bytes' object has no attribute 'encode'
    

    Again, that little b prefix is Python 3’s way of indicating that the object is already straight up serialized binary (pay no attention to the fact that it appears to be displaying legible text—it could be anything). Since it’s already encoded binary, you can’t re-encode it unless you first decode it using the correct deserialization method. In this case, we know that this particular binary object contains only ASCII data, and because all ASCII data is also valid UTF-8 data by definition, in order to make it into a proper Unicode object for manipulation, all we have to do is to interpret it as UTF-8:

    >>> s2 = s1.decode('utf-8')
    >>> s2
    'Hello, world!'
    >>> s3 = s2 + '☃ '
    >>> s3.encode('utf-8')
    b'Hello, world! \xe2\x98\x83'
    

 

  • DON’T use .decode() on Unicode objects 

    This situation is similar to .encode() above. Python 2’s .decode() tends to be really forgiving with single-byte, ASCII-compatible UTF-8 data…

    >>> s1 = 'Hello, world!'
    >>> s1.decode('utf-8')
    u'Hello, world!'
    >>> s1.decode('utf-8').decode('utf-8').decode('utf-8')
    u'Hello, world!'
    

    …but as soon as a multi-byte UTF-8 code point enters the picture, things can go haywire with overuse of .decode():

    >>> s1
    'Hello, snowman! \xe2\x98\x83'
    >>> s1.decode('utf-8')
    u'Hello, snowman! \u2603'
    >>> s1.decode('utf-8').decode('utf-8')
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/encodings/utf_8.py", line 16, in decode
        return codecs.utf_8_decode(input, errors, True)
    UnicodeEncodeError: 'ascii' codec can't encode character u'\u2603' in position 16: ordinal not in range(128)
    

    Again, Python 3 removes this subtle potential for confusion by simply refusing to allow you to re-interpret (re-decode()) the data for an already properly initialized Unicode object (the .decode() method doesn’t exist for the str type):

    >>> s1
    'Hello, snowman! ☃'
    >>> s1.decode('utf-8')
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    AttributeError: 'str' object has no attribute 'decode'
    

    So, just like it’s best practice to avoid doubling-down with encode(), the same is true for decode().

    Essentially, Python 3 just does a better job of enforcing what is already a best practice for both versions of the language.

 

  • DON’T use .format() on bytestring objects 

    Q. “But why not? It works great!”

    >>> s = 'Hello, {}!'
    >>> s.format('world')
    'Hello, world!'
    

    A. In Python 2, the .format() function usually works great on both the str and the unicode types, so long as the types are the same…

    >>> print 'Hello, {}!'.format('world')
    Hello, world!
    >>> print u'\uD83D{}'.format(u'\uDC1F')  # Surrogate pair
    🐟
    

    …sometimes it even works when you mix and match, but only for ASCII-friendly Unicode objects…

    >>> print 'Hello, {}!'.format(u'world')
    Hello, world!
    

    however, as soon as you try to mix-and-match bytestrings and Unicode objects containing multi-byte code points, things blow up:

    >>> print 'Hello, {}!'.format(u'snowman \u2603')
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    UnicodeEncodeError: 'ascii' codec can't encode character u'\u2603' in position 8: ordinal not in range(128)
    

    To remove potential for this kind of super annoying sometimes-works-sometimes-doesn’t confusion, Python 3 has removed the .format() function from the bytes type entirely (aka the str type of Python 2):

    >>> b'Hello, {}!'.format('world')
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    AttributeError: 'bytes' object has no attribute 'format'
    

    Thus, once again, Python 3 simply does a better job of enforcing what is already a best practice for both versions of the language: don’t use .format() with binary data. (Note, the rule here is not to avoid using .format() entirely—just avoid using it on str types, aka the bytes types of Python 3.)

 

  • DO use string concatenation with + for simple cases, and maybe use string formatting with % when needed, but defintely use string formatting with .format() for complex cases if you know the types you’re using it with are indeed Unicode objects 

    The + format operator tends to behave much better than .format() when dealing with mixed data types:

    >>> print 'Hello, ' + u'snowman \u2603!'
    Hello, snowman ☃!
    

    It even up-converts your strings from str type to unicode type, which is exactly what we want according to Point 2 of the rule of thumb outlined above:

    >>> s = 'Hello, ' + u'snowman \u2603!'
    >>> type(s)
    <type 'unicode'>
    

    For long strings, it can get a bit unwieldy, and, therefore, is not advisable to use for complex string templates…

    >>> s = 'The machine named "' + machine + '" at IP ' + ip_addr + ' responded with "' + response '"'
    

    …but for simple cases, the + operator actually works quite well, and is, therefore, preferable over something like this:

    >>> data = 'Message: {}'.format(msg)  # No...
    >>> data = 'Message: ' + msg          # Yes!
    

    As it turns out, string concatenation with % behaves equally well with mixed input types:

    >>> s = 'Hello, %s!' % u'snowman \u2603'
    >>> print s
    Hello, snowman ☃!
    >>> type(s)
    <type 'unicode'>
    

    In fact, the only thing that can possibly break either of these is if you try to mix Unicode objects with UTF-8 encoded binary, but that shouldn’t ever be done anyway because that kind of operation breaks literally everything (and incidentally is exactly why Point 2 of the rule of thumb exists: UTF-8 binary is meant for inputs/outputs, not in-memory transformations: when working with strings, only use Unicode objects).

    >>> unistr = u'Unicode string \u2603'
    >>> unitmp = u'Unicode template \u2603 [%s] [{}]...'
    >>> utf8str = 'UTF-8 string \xef\xbf\xbd'
    >>> utf8tmp = 'UTF-8 template \xef\xbf\xbd [%s] [{}]...'
    >>> unitmp % utf8str        # UnicodeDecodeError!!
    >>> unitmp + utf8str        # UnicodeDecodeError!!
    >>> unitmp.format(utf8str)  # UnicodeDecodeError!!
    >>> utf8tmp % unistr        # UnicodeDecodeError!!
    >>> utf8tmp + unistr        # UnicodeDecodeError!!
    >>> utf8tmp.format(unistr)  # UnicodeDecodeError!!
    

    However, this would be an excellent time to bring up PEP 3101

    “This PEP proposes a new system for built-in string formatting operations, intended as a replacement for the existing ‘%’ string formatting operator.”

    PEP 3101

    …as well as this email from Guido:

    “…If s.format() were available, I’d use it in preference over s%x, just like I already use repr(x) in favor of `x`. And just like `x` is slated for removal in Python 3000, we might consider removing using % for formatting.”

    Guido van Rossum, “String formating operations in python 3k

    In other words, what this means is that % can be used, and may even be preferable over + for all but the simplest of cases, and may also be preferable over .format() for some situations because of its better behavior—but only for Python 2 projects that will never be used on any other version of Python!

    If you need future compatibility, for simple cases you should prefer string concatenation with + (because it shares the good behavior of %), or you should be using .format(), so long as you’re aware of the caveat noted above, and only make sure to use it on real Unicode objects.

  • DON’T use the u prefix to denote Unicode literals if you need to support Python 3.0, 3.1, or 3.2. 

    Because of the need to use .format() only with Unicode objects, it’s been a fairly common practice among our team to do something like this:

    >>> u'template: {x} {y} {z}'.format(x=x, y=y, z=z)
    

    But far from being a totally foolproof fix, developers should be aware this actually breaks compatibility with some versions of Python 3:

    >>> u'asdf'
      File "<stdin>", line 1
        u'asdf'
              ^
    SyntaxError: invalid syntax
    

    Fortunately, the u prefix was eventually reintegrated in Python version 3.3. The gist is that if you need to support Python 3.0, 3.1, or 3.2 then stay away from it, and instead use something like six.u(), or __future__.unicode_literals, as outlined below.

  • DO use from __future__ import unicode_literals 

    If you want your Unicode-aware Python 2 projects to behave better, consider throwing a from __future__ import unicode_literals at the top of any Python file that makes use of 'literal strings'.

    This will help you in three ways:

    1. Your 'literal strings' will be Unicode objects by default—in both Python 2 and Python 3—even without the u prefix. This will enable you to use things like .format() and .encode() to your heart’s content.

    2. Any str-type string objects residing in variables that are concatenated with a 'literal string' will be automatically up-converted to unicode-type string objects, which, when manipulating strings in memory, is exactly what we want according to Point 2 of the rule of thumb outlined above.

    For example:

    >>> bytestr = 'hello'
    >>> from __future__ import unicode_literals
    >>> unistr = 'world'
    >>> bytestr
    'hello'
    >>> unistr
    u'world'
    >>> type(bytestr)
    <type 'str'>
    >>> type(unistr)
    <type 'unicode'>
    >>> newstr = bytestr + ' ' + unistr
    >>> newstr
    u'hello world'
    >>> type(newstr)
    <type 'unicode'>
    

    But beware! If there are parts of your application which violate Point 2 by passing around multi-byte binary encodings in memory instead of using Unicode objects, this could actually give you a bit of a pain-point if your code tries to concatenate binary data containing multi-byte code points with a Unicode object:

    >>> bytestr = 'uh oh \ufffd'.encode('utf-8')
    >>> bytestr
    'uh oh \xef\xbf\xbd'
    >>> type(bytestr)
    <type 'str'>
    >>> newstr = bytestr + ' ' + unistr
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position 6: ordinal not in range(128)
    

    However, this might actually be a good thing in the long run, as noted in the next point…

    3. This will help you to identify those portions of your application which may not be as Unicode aware as they should be. In general, if your application is passing around multi-byte encodings in memory rather than using Unicode objects, it may not be aware that it’s doing so, and you are probably avoiding Unicode errors more by luck than by design.

    As a final note, you might consider a third-party library like six.u() instead of __future__.unicode_literals if you would prefer a more iterative approach to making your modules Unicode aware.

  • DON’T raise exceptions with UTF-8 encoded binary bytestrings 

    While this may work, it’s actually a bad idea:

    >>> print branch_name
    pro_tips™
    >>> type(branch_name)
    <type 'unicode'>
    >>> raise SomeException('Invalid branch: ' + branch_name.encode('utf-8'))
    

    Why? Consider the people that will need to try catching and handling your exception:

    >>> try:
    ...     <code that raises an exception>
    ... except SomeException as e:
    ...     warn_devs(
    ...         'Caught exception:\n' + e.message + '\n\nMitigating...')
    ...     <code that mitigates exception>
    

    Looks okay though, right? Wrong:

    def warn_devs(msg):
        requests.post('/warnings/', data=msg.encode('utf-8'))
    

    Why is this bad? Remember that msg is already encoded binary in Python 2!

    >>> type(e.message)
    <type 'str'>
    >>> e.message
    'Invalid branch: pro_tips\xe2\x84\xa2'
    

    It’s bad practice to try re-encoding it, since it won’t always work, and instead of getting the message you expected, you’ll just end up masking the issue with a Unicode error. Okay, so we just fix it by ceasing to re-encode, and document that the expected input type is a str rather than a unicode, right?

    def warn_devs(msg):
        """
        Warn devs that an exception was caught but was handled.
    
        @param msg: The message to send.
        @type msg: str
        """
        requests.post('/warnings/', data=msg)
    

    Wrong. This goes against Points 2 and 3 of the rule of thumb: namely, that all string objects in memory must be Unicode objects (using whatever form of representation the language interpreter prefers), and that they should only be serialized as UTF-8 when you’re actually ready to transmit or send them off somewhere. Essentially, writing an API that takes in a bytestring like this has the potential to give you many headaches down the road, because you will be required to remember to .encode('utf-8') all your data before you use this kind of API. Turns out, that’s a really easy thing to forget to do. And when you forget it, those Unicode errors are ready and willing to start filling up your logs. Better to .encode('utf-8') in one place—the warn_devs() function that’s actually responsible for sending out the data—and just require all consumers of the function to use Unicode objects instead of bytestrings (which they should be doing anyway, if they’re adhering to Points 1 and 2 of the rule of thumb).

  • def warn_devs(msg):
        """
        Warn devs that an exception was caught but was handled.
    
        @param msg: The message to send.
        @type msg: unicode
        """
        requests.post('/warnings/', data=msg.encode('utf-8'))
    

    So, what’s a proper fix? Simple: Just allow yourself to raise exceptions with Unicode objects, not bytestrings.

    >>> e = SomeException('Invalid branch: ' + branch_name)
    >>> type(e.message)
    <type 'unicode'>
    >>> e.message
    u'Invalid branch: pro_tips\u2122'
    

 

  • DON’T use Exception('...').message, or str(Exception('...')), or unicode(Exception('...'))

    Okay, this is admittedly less of a Unicode and UTF-8 issue, and more of an issue of divergence between Python 2 and Python 3, but it’s still relevant because the commonly-employed solutions to this problem generally involve violating the rule of thumb in one way or another and end up causing Unicode errors as a result, so it’s a good thing to be aware of.

    Continuing with the example above:

    >>> try:
    ...     <code that raises an exception>
    ... except SomeException as e:
    ...     warn_devs(
    ...         'Caught exception:\n' + e.message + '\n\nMitigating...')
    ...     <code that mitigates exception>
    

    This works fine in Python 2, but if you need to support both Python 2 and Python 3, this isn’t usually an option. Why? Because a huge chunk of the built-in Exception types in Python 3 no longer have a message attribute:

    >>> e = Exception('heyyy')
    >>> e.message
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    AttributeError: 'Exception' object has no attribute 'message'
    

    Oftentimes, the solution to make this work in both Python 2 and Python 3 is to str()-ify the exception object itself—which just calls e.__str__() in the background—which seems to work great:

    >>> str(e)
    'heyyy'
    

    Problem: This breaks Unicode awareness in Python 2.

    >>> e = Exception(u'Erroneous snowman ☃')
    >>> str(e)
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    UnicodeEncodeError: 'ascii' codec can't encode character u'\u2603' in position 18: ordinal not in range(128)
    

    One solution would be to use unicode() instead…

    >>> unicode(e)
    u'Erroneous snowman \u2603'
    

    …except that this breaks entirely in Python 3, because as mentioned earlier, unicode isn’t even a valid type. What’s more, as it turns out this solution isn’t even that reliable in Python 2 itself anyway, because if the exception message happens to consist of UTF-8 encoded binary data (which, as mentioned earlier, is a bad way to raise exceptions for this very reason, but hey, it happens), everything explodes:

    >>> s
    'uh oh \xef\xbf\xbd'
    >>> e = Exception(s)
    >>> unicode(e)
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position 6: ordinal not in range(128)
    

 

  • DO use Exception('...').message, and Exception('...').args, and repr(Exception('...')) 

    The previous example talked about when using Exception('...').message was a bad idea. As it turns out, it’s only a bad idea if you need to support both Python 2 and Python 3. If you only care about Python 2, then Exception('...').message is actually one of the better solutions you can utilize, since most of the time it will give you direct access to the data you’re after without any hassles (unless the data is UTF-8 encoded binary, but I’d be beating a dead horse if I said yet again that it’s a bad idea to raise exceptions like that anyway).

    If Exception('...').message isn’t to your liking, consider Exception('...').args or repr(Exception('...')):

    >>> e1, e2, e3 = Exception('ASCII'), Exception(u'Unicode \ufffd'), Exception(u'UTF-8 \ufffd'.encode('utf-8'))
    >>> e1.args, e2.args, e3.args
    (('ASCII',), (u'Unicode \ufffd',), ('UTF-8 \xef\xbf\xbd',))
    >>> repr(e1), repr(e2), repr(e3)
    ("Exception('ASCII',)", "Exception(u'Unicode \\ufffd',)", "Exception('UTF-8 \\xef\\xbf\\xbd',)")
    

    They work in Python 3, too:

    >>> e1, e2, e3 = Exception('ASCII'.encode('utf-8')), Exception('Unicode \ufffd'), Exception('UTF-8 \ufffd'.encode('utf-8'))
    >>> e1.args, e2.args, e3.args
    ((b'ASCII',), ('Unicode �',), (b'UTF-8 \xef\xbf\xbd',))
    >>> repr(e1), repr(e2), repr(e3)
    ("Exception(b'ASCII',)", "Exception('Unicode �',)", "Exception(b'UTF-8 \\xef\\xbf\\xbd',)")
    

    Both e.args and repr(e) generally work great, with one caveat: they don’t always provide useful data! They won’t error out on you though, so that’s nice, but they won’t always provide you with that juicy debugging information you’ve just spent the afternoon hunting down, either:

    >>> import urllib2
    >>> try:
    ...     urllib2.urlopen('http://httpbin.org/status/500')
    ... except Exception as e:
    ...     pass
    ...
    >>> e.args
    ()
    >>> repr(e)
    'HTTPError()'
    >>> str(e)  # THIS is the stuff we were after, but don't actually do this...
    'HTTP Error 500: INTERNAL SERVER ERROR'
    

    Catching and logging exceptions in a Unicode-friendly way is probably one of the hardest things to get right when it comes to writing Unicode-aware Python applications, because there are so many exception types out there, and you can’t really guarantee that all raisers of exceptions will behave properly and raise them with Unicode messages and not bytestrings. The only real way to get this right is to know what exceptions you want to catch—e.g., don’t except on Exception; it’s too broad—and when you want to catch them—i.e., know under what circumstances they will be raised and what kind of data they will contain when they’re thrown, so that you can get the data you want out of them.

    However, for those situations in which you simply must do something drastic—like except on Exception—where it’s just not possible to be fully aware of exactly what exception subclasses might come through your exception-handling logic, fear not: if robustness is what you’re after, there are yet two more foolproof ways of obtaining exception data in a Unicode-friendly way that won’t result in an error, regardless of whether the data contains Unicode or bytestrings.

  • DO use Python’s standard traceback module

    The standard traceback module has a number of methods that Unicode-aware Pythonistas would do well to make use of. We’re only going to look at three today: format_exc(), format_exception(), and format_exception_only(). Let’s look at the last one first, as it’s the least nuanced, and work our way backward:

    >>> import sys, traceback
    >>> e1, e2, e3 = Exception('ASCII'), Exception(u'Unicode \ufffd'), Exception(u'UTF-8 \ufffd'.encode('utf-8'))
    >>> for e in [e1, e2, e3]:
    ...     traceback.format_exception_only(type(e), e)
    ...
    ['Exception: ASCII\n']
    ['Exception: Unicode \\ufffd\n']
    ['Exception: UTF-8 \xef\xbf\xbd\n']
    

    Since format_exception_only() returns a list of bytestrings, in order to get proper Unicode objects, you have only to join them as such, and then decode the result as UTF-8. Voilà!

    >>> for e in [e1, e2, e3]:
    ...     lines = traceback.format_exception_only(type(e), e)
    ...     msg = b''.join(lines).decode('utf-8')
    ...     print msg
    ...
    Exception: ASCII
    Exception: Unicode \ufffd
    Exception: UTF-8 �
    

    Bingo! Safe and easy-to-consume exception messages, regardless of either exception or message type.

    But what about when you want to include a traceback for even more debugging data? That’s where format_exception() comes in:

    >>> for e in [e1, e2, e3]:
    ...     try:
    ...         raise e
    ...     except Exception:
    ...         lines = traceback.format_exception(*sys.exc_info())
    ...         msg = b''.join(lines).decode('utf-8')
    ...         print msg
    
    Traceback (most recent call last):
      File "<stdin>", line 3, in <module>
    Exception: ASCII
    
    Traceback (most recent call last):
      File "<stdin>", line 3, in <module>
    Exception: Unicode \ufffd
    
    Traceback (most recent call last):
      File "<stdin>", line 3, in <module>
    Exception: UTF-8 �
    

    format_exception() is really only useful when given three items, the lot of which can be returned by a call to sys.exc_info(), which returns an Exception Context that looks something like this:

    (<type 'exceptions.Exception'>, Exception('whoa',), <traceback object>)
    

    The first two items in this tuple are the same arguments that are passed into format_exception_only(), but the last one is what is responsible for the inclusion of the traceback information. Without it, the behavior of format_exception() and format_exception_only() is more or less the same. These traceback objects can only be generated while the Python interpreter has a genuine exception context (i.e., in an except block), however, when creating them, it is a good idea to avoid assigning them to local variables, lest you risk creating a circular reference in memory. Since a reference is not stored to the output of sys.exc_info() in the example above, the traceback object is always successfully garbaged as soon as useful information has been extracted from it.

    Alternatively, we could just let format_exc() do most of the heavy lifting (namely, the generating of the exception context tuple, the joining of the bytestrings) for us:

    >>> for e in [e1, e2, e3]:
    ...     try:
    ...         raise e
    ...     except Exception:
    ...         print traceback.format_exc().decode('utf-8')
    ... 
    Traceback (most recent call last):
      File "<stdin>", line 3, in <module>
    Exception: ASCII
    
    Traceback (most recent call last):
      File "<stdin>", line 3, in <module>
    Exception: Unicode \ufffd
    
    Traceback (most recent call last):
      File "<stdin>", line 3, in <module>
    Exception: UTF-8 �
    

    Once obtained, these Unicode objects can be digested and consumed however you need: automated email/chat notifications, exception-handling routines, you name it. But what if all you’re interested in is logging? In that case, a yet simpler solution remains: The logging module’s exception() API—which is really just shorthand for log.error('...', exc_info=True).

  • DO use Python’s standard logging module 

    The Python logging module is amazing, really amazing:

    >>> import logging
    >>> logging.warn('ASCII works :)')
    WARNING:root:ASCII works :)
    >>> logging.warn(u'Unicode works \u263A')
    WARNING:root:Unicode works 
    >>> logging.warn(u'UTF-8 works \u263A'.encode('utf-8'))
    WARNING:root:UTF-8 works 
    

    However, configuring it to be a bit more informative can be a bit of a chore due to its very Java-like interface. Even getting it to display timestamps in UTC can take a bit of finagling. But once done (preferably early in your code), it stays done:

    # See google_appserver/google/appengine/tools/dev_appserver_main.py
    GAE_PREFIX = '%(levelname)-8s %(asctime)s,%(msecs)03d %(filename)s:%(lineno)s]'
    LOG_FORMAT = GAE_PREFIX + ' %(name)s: %(message)s'
    ISO_FORMAT = '%Y-%m-%d %H:%M:%S'
    
    # https://docs.python.org/2/library/logging.html#logging.Formatter.formatTime
    logging.Formatter.converter = time.gmtime  # Use UTC time instead of local
    root_formatter = logging.Formatter(fmt=LOG_FORMAT, datefmt=ISO_FORMAT)
    
    # Create the root handler
    root_handler = logging.StreamHandler()
    root_handler.setFormatter(root_formatter)
    
    # Configure the root logger, from which all other loggers will inherit
    logging.root.handlers = [root_handler]
    logging.root.setLevel(logging.DEBUG)
    

    From that point on using the logging package properly in any of your modules is simply a matter of:

    import logging
    logger = logging.getLogger(__name__)
    logger.info(ANY_FLAVOR_OF_STRING_CHEESE_YOU_LIKE)
    

    Which should produce something like:

    INFO     2015-08-19 15:00:00,413 mymodule.py:4] mylogger: Hello, world! ☺
    

    Each logger you create (including the root logger) can have multiple handlers attached to it. If you ever need to create additional handlers and want to keep the same formatting across the board, just reference the root logger’s formatter:

    root_formatter = logging.root.handlers[0].formatter
    new_handler = logging.FileHandler('file.log', encoding='utf-8')
    new_handler.setFormatter(root_formatter)
    logger.addHandler(new_handler)
    

 

  • DON’T pass Exception instances into logging 

    This used to be a fairly common practice among our team:

    >>> try:
    ...     <code that raises an exception>
    ... except SomeException as e:
    ...     logger.warn(e)
    

    And at first glance, it seems to work just fine…

    >>> e = Exception('hi there!')
    >>> logger.warn(e)
    WARNING:logger:hi there!
    

    …even with UTF-8 data…

    >>> e = Exception('\xe2\x98\x83')
    >>> logger.warn(e)
    WARNING:logger:☃
    

    …but it does not work when the exception message is a Unicode object!

    >>> e = Exception(u'\u2603')
    >>> logger.warn(e)
    Traceback (most recent call last):
      File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/logging/__init__.py", line 859, in emit
        msg = self.format(record)
      File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/logging/__init__.py", line 732, in format
        return fmt.format(record)
      File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/logging/__init__.py", line 474, in format
        s = self._fmt % record.__dict__
    UnicodeEncodeError: 'ascii' codec can't encode character u'\u2603' in position 0: ordinal not in range(128)
    Logged from file <stdin>, line 1
    

    The difficulties incurred when raising exceptions with UTF-8 encoded binary data instead of Unicode objects have already been discussed, so that’s clearly not an acceptable solution to this problem.

  • DO use the logging module’s exception() API

    The exception() API is great at bulletproofing your exception logging:

    >>> try:
    ...     raise Exception(u'\u2603')
    ... except Exception:
    ...     logger.exception('Wild snowman appeared!')
    ...
    ERROR:logger:Wild snowman appeared!
    Traceback (most recent call last):
      File "<stdin>", line 2, in <module>
    Exception: \u2603
    

    What’s more, exception() is really just a shorthand way of writing logger.error('...', exc_info=True), which means you can use this functionality at any level you want because all logging level methods accept the exc_info argument. (If you’ve been reading straight through this document, that exc_info argument should look familiar to you.)

    It works with exceptions containing nasty UTF-8 encoded bytestring messages, too!

  • DON’T use APIs blindly 

    As much as Points 2 and 3 of the rule of thumb above are good to adhere to in general—namely, the rules recommending the use of Unicode objects everywhere in your code, and that they should only be serialized before they’re sent off somewhere else—there are some cases where you may need to serialize (.encode()) your Unicode objects a bit early, due to brain-dead APIs that don’t respect them:

    >>> from urllib import urlencode
    >>> urlencode({'ascii': 'abc'})
    'ascii=abc'
    >>> urlencode({'utf8': u'\ufffd'.encode('utf-8')})
    'utf8=%EF%BF%BD'
    >>> urlencode({'unicode': u'\ufffd'})
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py", line 1332, in urlencode
        v = quote_plus(str(v))
    UnicodeEncodeError: 'ascii' codec can't encode character u'\ufffd' in position 0: ordinal not in range(128)
    

    Fortunately, things got a lot better with Python 3 in this regard (at least as far as concerns urlencode())…

    >>> from urllib.parse import urlencode
    >>> urlencode({'ascii': 'abc', 'unicode': '\ufffd', 'utf-8': b'\xef\xbf\xbd'})
    'unicode=%EF%BF%BD&ascii=abc&utf-8=%EF%BF%BD'
    

    …but Python 2 devs still need to be careful.

  • DO use APIs intelligently

    Sometimes APIs even get things a little too right, proving once again that Postel’s Law principle of being liberal in what you accept and conservative in what you output is a principle only, not a fast rule. Why? Well, for one thing, it simply isn’t always as useful as it might sound at first. For example, take the json API in Python 2.7: it accepts almost anything and dumps it without complaining, but it only outputs Unicode objects when reading from previously dumped data. Sounds like exactly what we want, right?

    >>> import json
    >>> d = {'ascii': 'abc', 'unicode': u'\ufffd', 'utf-8': '\xef\xbf\xbd'}
    >>> s = json.dumps(d)
    >>> s
    '{"ascii": "abc", "utf-8": "\\ufffd", "unicode": "\\ufffd"}'
    >>> j = json.loads(s)
    >>> j
    {u'ascii': u'abc', u'utf-8': u'\ufffd', u'unicode': u'\ufffd'}
    

    Most of the time, yes, absolutely. Sometimes, though, it would be nice to have the option to return data a bit differently (especially when it’s being returned by an API that deals with a lot of recursive logic internally that would be difficult to duplicate) if for no other reason than to make it, well, just a little bit more consumable:

    >>> urlencode(j)
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py", line 1332, in urlencode
        v = quote_plus(str(v))
    UnicodeEncodeError: 'ascii' codec can't encode character u'\ufffd' in position 0: ordinal not in range(128)
    

    Granted, this behavior is more a fault of the urlencode() function than it is of the json API, but even so, but it would have been helpful if the json API allowed for a little bit more freedom as to what kinds of data it yields. Unfortunately, there’s not really an easy way to get it to behave differently, outside of resorting to custom helper functions:

    >>> def transcode(encoding='utf-8'):
    ...     def encoder(data):
    ...         if isinstance(data, dict):
    ...             return dict(map(encoder, data.items()))
    ...         elif hasattr(data, '__iter__'):
    ...             return type(data)(map(encoder, data))
    ...         elif isinstance(data, basestring):
    ...             return data.encode(encoding)
    ...         else:
    ...             return data
    ...     return encoder
    ...
    >>> json.loads(s, object_hook=transcode('utf-8'))
    {'ascii': 'abc', 'utf-8': '\xef\xbf\xbd', 'unicode': '\xef\xbf\xbd'}
    >>> urlencode(_)
    'ascii=abc&utf-8=%EF%BF%BD&unicode=%EF%BF%BD'
    

    Fortunately, the other side of the json API—the .dumps() side—isn’t quite so stringent, and it doesn’t require nearly as much elbow grease to get it to behave differently, depending upon your situation:

    >>> json.dumps({'unicode': u'\ufffd'}, ensure_ascii=False).encode('utf-8')
    '{"unicode": "\xef\xbf\xbd"}'
    >>> json.dumps({'utf-8': '\xef\xbf\xbd'}, ensure_ascii=False)
    '{"utf-8": "\xef\xbf\xbd"}'
    

    However, when using the ensure_ascii=False argument, the json.dumps() API is not without its own fair share of error-throwing behaviors, as mixing-and-matching input data plainly shows (proving yet again that it’s almost always a bad idea to pass around binary strings in memory: stick to Unicode objects):

    >>> json.dumps({'unicode': u'\ufffd', 'utf-8': u'\ufffd'.encode('utf-8')}, ensure_ascii=False)
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/__init__.py", line 250, in dumps
        sort_keys=sort_keys, **kw).encode(obj)
      File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/encoder.py", line 210, in encode
        return ''.join(chunks)
    UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position 1: ordinal not in range(128)
    

 

  • Above all else, DO test your code!

    Unicode and UTF-8 themselves are easy. But getting APIs (even built-ins) to behave can be tricky, especially with languages like Python 2 that were crafted years ago, during the infancy of the transition from ASCII to UTF-8.

    At the end of the day, the best thing you can do to avoid these kinds of scenarios is to consider your data sources (remember: they’re all UTF-8 now, unless specifically documented as being otherwise), and thoroughly test all your edge cases! Every single one of these errors are things that should be capable of being caught and accounted for with a sufficient amount of unit-testing. It doesn’t even take a great Herculean effort to guard against—you shouldn’t have to add any extra tests (unless your test count is, well, zero)—just take all of your regular tests that pass around textual data, and make sure they do two things:

    1. They pass around all their textual data as Unicode objects.

    2. Their Unicode objects all contain ASCII-unfriendly characters.

    Seriously, even Google does it:

    Google snowman

    Oh, hey, look, it’s our friend the Snowman! ☃

    If you truly want to go above and beyond, try testing with characters beyond U+FFFF (Plane 0), but even just testing with something—anything—beyond U+00FF (ASCII) is usually sufficient.

Wrapping up 🎁

Well, that’s about all I’ve got for now. Hopefully, it will be useful.

If you have a better experience with some of these languages, or if you know of some gotchas, caveats, or best practices for Unicode and UTF-8 that have thus far gone unmentioned or undocumented, feel free to comment.

Good luck and Godspeed!

Happy unicoding!

Related Articles:

Add new comment

CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.