UTF-8 HTML and CSS files with BOM (and how to remove the BOM with Python)

Since you state: All of my (text) files are currently stored in UTF-8 with the BOM then use the ‘utf-8-sig’ codec to decode them: >>> s = u’Hello, world!’.encode(‘utf-8-sig’) >>> s ‘\xef\xbb\xbfHello, world!’ >>> s.decode(‘utf-8-sig’) u’Hello, world!’ It automatically removes the expected BOM, and works correctly if the BOM is not present as well.

Removing BOM characters using Java [duplicate]

Java does not handle BOM properly. In fact Java handles a BOM like every other char. Found this: http://www.rgagnon.com/javadetails/java-handle-utf8-file-with-bom.html public static final String UTF8_BOM = “\uFEFF”; private static String removeUTF8BOM(String s) { if (s.startsWith(UTF8_BOM)) { s = s.substring(1); } return s; } May be I would use apache IO instead: http://commons.apache.org/proper/commons-io/apidocs/org/apache/commons/io/input/BOMInputStream.html

VBA Output to file using UTF-16

Your point about UTF-8 not being able to store all characters you need is invalid. UTF-8 is able to store every character defined in the Unicode standard. The only difference is that, for text in certain languages, UTF-8 can take more space to store its codepoints than, say, UTF-16. The opposite is also true: for … Read more

Create Text File Without BOM

Well it writes the BOM because you are instructing it to, in the line Encoding utf8WithoutBom = new UTF8Encoding(true); true means that the BOM should be emitted, using Encoding utf8WithoutBom = new UTF8Encoding(encoderShouldEmitUTF8Identifier: false); writes no BOM. My objective is create a file using UTF-8 as Encoding and 8859-1 as CharSet Sadly, this is not … Read more

Adding UTF-8 BOM to string/Blob

Prepend \ufeff to the string. See http://msdn.microsoft.com/en-us/library/ie/2yfce773(v=vs.94).aspx See discussion between @jeff-fischer and @casey for details on UTF-8 and UTF-16 and the BOM. What actually makes the above work is that the string \ufeff is always used to represent the BOM, regardless of UTF-8 or UTF-16 being used. See p.36 in The Unicode Standard 5.0, Chapter … Read more

How to avoid tripping over UTF-8 BOM when reading files

With ruby 1.9.2 you can use the mode r:bom|utf-8 text_without_bom = nil #define the variable outside the block to keep the data File.open(‘file.txt’, “r:bom|utf-8”){|file| text_without_bom = file.read } or text_without_bom = File.read(‘file.txt’, encoding: ‘bom|utf-8’) or text_without_bom = File.read(‘file.txt’, mode: ‘r:bom|utf-8′) It doesn’t matter, if the BOM is available in the file or not. You may … Read more

R’s read.csv prepending 1st column name with junk text [duplicate]

You’ve got a Unicode UTF-8 BOM at the start of the file: http://en.wikipedia.org/wiki/Byte_order_mark A text editor or web browser interpreting the text as ISO-8859-1 or CP1252 will display the characters  for this R is giving you the ï and then converting the other two into dots as they are non-alphanumeric characters. Here: http://r.789695.n4.nabble.com/Writing-Unicode-Text-into-Text-File-from-R-in-Windows-td4684693.html Duncan … Read more