Reading Unicode file data with BOM chars in Python

There is no reason to check if a BOM exists or not, utf-8-sig manages that for you and behaves exactly as utf-8 if the BOM does not exist:

# Standard UTF-8 without BOM
>>> b'hello'.decode('utf-8')
'hello'
>>> b'hello'.decode('utf-8-sig')
'hello'

# BOM encoded UTF-8
>>> b'\xef\xbb\xbfhello'.decode('utf-8')
'\ufeffhello'
>>> b'\xef\xbb\xbfhello'.decode('utf-8-sig')
'hello'

In the example above, you can see utf-8-sig correctly decodes the given string regardless of the existence of BOM. If you think there is even a small chance that a BOM character might exist in the files you are reading, just use utf-8-sig and not worry about it

Leave a Comment