SQLite, python, unicode, and non-utf data

I’m still ignorant of whether there is a way to correctly convert ‘ó’ from latin-1 to utf-8 and not mangle it

repr() and unicodedata.name() are your friends when it comes to debugging such problems:

>>> oacute_latin1 = "\xF3"
>>> oacute_unicode = oacute_latin1.decode('latin1')
>>> oacute_utf8 = oacute_unicode.encode('utf8')
>>> print repr(oacute_latin1)
'\xf3'
>>> print repr(oacute_unicode)
u'\xf3'
>>> import unicodedata
>>> unicodedata.name(oacute_unicode)
'LATIN SMALL LETTER O WITH ACUTE'
>>> print repr(oacute_utf8)
'\xc3\xb3'
>>>

If you send oacute_utf8 to a terminal that is set up for latin1, you will get A-tilde followed by superscript-3.

I switched to Unicode strings.

What are you calling Unicode strings? UTF-16?

What gives? After reading this, describing exactly the same situation I’m in, it seems as if the advice is to ignore the other advice and use 8-bit bytestrings after all.

I can’t imagine how it seems so to you. The story that was being conveyed was that unicode objects in Python and UTF-8 encoding in the database were the way to go. However Martin answered the original question, giving a method (“text factory”) for the OP to be able to use latin1 — this did NOT constitute a recommendation!

Update in response to these further questions raised in a comment:

I didn’t understand that the unicode characters still contained an implicit encoding. Am I saying that right?

No. An encoding is a mapping between Unicode and something else, and vice versa. A Unicode character doesn’t have an encoding, implicit or otherwise.

It looks to me like unicode(“\xF3”) and “\xF3”.decode(‘latin1’) are the same when evaluated with repr().

Say what? It doesn’t look like it to me:

>>> unicode("\xF3")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xf3 in position 0: ordinal
not in range(128)
>>> "\xF3".decode('latin1')
u'\xf3'
>>>

Perhaps you meant: u'\xf3' == '\xF3'.decode('latin1') … this is certainly true.

It is also true that unicode(str_object, encoding) does the same as str_object.decode(encoding) … including blowing up when an inappropriate encoding is supplied.

Is that a happy circumstance

That the first 256 characters in Unicode are the same, code for code, as the 256 characters in latin1 is a good idea. Because all 256 possible latin1 characters are mapped to Unicode, it means that ANY 8-bit byte, ANY Python str object can be decoded into unicode without an exception being raised. This is as it should be.

However there exist certain persons who confuse two quite separate concepts: “my script runs to completion without any exceptions being raised” and “my script is error-free”. To them, latin1 is “a snare and a delusion”.

In other words, if you have a file that’s actually encoded in cp1252 or gbk or koi8-u or whatever and you decode it using latin1, the resulting Unicode will be utter rubbish and Python (or any other language) will not flag an error — it has no way of knowing that you have commited a silliness.

or is unicode(“str”) going to always return the correct decoding?

Just like that, with the default encoding being ascii, it will return the correct unicode if the file is actually encoded in ASCII. Otherwise, it’ll blow up.

Similarly, if you specify the correct encoding, or one that’s a superset of the correct encoding, you’ll get the correct result. Otherwise you’ll get gibberish or an exception.

In short: the answer is no.

If not, when I receive a python str that has any possible character set in it, how do I know how to decode it?

If the str object is a valid XML document, it will be specified up front. Default is UTF-8.
If it’s a properly constructed web page, it should be specified up front (look for “charset”). Unfortunately many writers of web pages lie through their teeth (ISO-8859-1 aka latin1, should be Windows-1252 aka cp1252; don’t waste resources trying to decode gb2312, use gbk instead). You can get clues from the nationality/language of the website.

UTF-8 is always worth trying. If the data is ascii, it’ll work fine, because ascii is a subset of utf8. A string of text that has been written using non-ascii characters and has been encoded in an encoding other than utf8 will almost certainly fail with an exception if you try to decode it as utf8.

All of the above heuristics and more and a lot of statistics are encapsulated in chardet, a module for guessing the encoding of arbitrary files. It usually works well. However you can’t make software idiot-proof. For example, if you concatenate data files written some with encoding A and some with encoding B, and feed the result to chardet, the answer is likely to be encoding C with a reduced level of confidence e.g. 0.8. Always check the confidence part of the answer.

If all else fails:

(1) Try asking here, with a small sample from the front of your data … print repr(your_data[:400]) … and whatever collateral info about its provenance that you have.

(2) Recent Russian research into techniques for recovering forgotten passwords appears to be quite applicable to deducing unknown encodings.

Update 2 BTW, isn’t it about time you opened up another question ?-)

One more thing: there are apparently characters that Windows uses as Unicode for certain characters that aren’t the correct Unicode for that character, so you may have to map those characters to the correct ones if you want to use them in other programs that are expecting those characters in the right spot.

It’s not Windows that’s doing it; it’s a bunch of crazy application developers. You might have more understandably not paraphrased but quoted the opening paragraph of the effbot article that you referred to:

Some applications add CP1252 (Windows, Western Europe) characters to documents marked up as ISO 8859-1 (Latin 1) or other encodings. These characters are not valid ISO-8859-1 characters, and may cause all sorts of problems in processing and display applications.

Background:

The range U+0000 to U+001F inclusive is designated in Unicode as “C0 Control Characters”. These exist also in ASCII and latin1, with the same meanings. They include such familar things as carriage return, line feed, bell, backspace, tab, and others that are used rarely.

The range U+0080 to U+009F inclusive is designated in Unicode as “C1 Control Characters”. These exist also in latin1, and include 32 characters that nobody outside unicode.org can imagine any possible use for.

Consequently, if you run a character frequency count on your unicode or latin1 data, and you find any characters in that range, your data is corrupt. There is no universal solution; it depends on how it became corrupted. The characters may have the same meaning as the cp1252 characters at the same positions, and thus the effbot’s solution will work. In another case that I’ve been looking at recently, the dodgy characters appear to have been caused by concatenating text files encoded in UTF-8 and another encoding which needed to be deduced based on letter frequencies in the (human) language the files were written in.

Leave a Comment