SQL Server – defining an XML type column with UTF-8 encoding

Is there a way to define a SQL Server column/field as having UTF-8 encoding?

No, the only Unicode encoding in SQL Server is UTF-16 Little Endian, which is how the NCHAR, NVARCHAR, NTEXT (deprecated as of SQL Server 2005 so don’t use this in new development; besides, it sucks compared to NVARCHAR(MAX) anyway), and XML datatypes are handled. You do not get a choice of Unicode encodings like some other RDBMS’s allow.

You can insert UTF-8 encoded XML into SQL Server, provided you follow these three rules:

  1. The incoming string has to be of datatype VARCHAR, not NVARCHAR (as NVARCHAR is always UTF-16 Little Endian, hence the error about not being able to switch the encoding).
  2. The XML has an XML declaration that explicitly states that the encoding of the XML is indeed UTF-8: <?xml version="1.0" encoding="UTF-8" ?>.
  3. The byte sequence needs to be the actual UTF-8 bytes.

For example, we can import a UTF-8 encoded XML document containing the screaming face emoji (and we can get the UTF-8 byte sequence for that Supplementary Character by following that link):

SET NOCOUNT ON;
DECLARE @XML XML = '<?xml version="1.0" encoding="utf-8"?><root><test>'
                    + CHAR(0xF0) + CHAR(0x9F) + CHAR(0x98) + CHAR(0xB1)
                    + '</test></root>';

SELECT @XML;
PRINT CONVERT(NVARCHAR(MAX), @XML);

Returns (in both “Results” and “Messages” tabs):

<root><test>😱</test></root>

You mentioned in a comment on @Shnugo’s answer:

I’ve had no problems inserting utf-8 encoded streams with utf-8 header into SQL Server 2013 NVARCHAR column. Would there be a hidden problem?

No, you didn’t store UTF-8 encoded anything in an NVARCHAR column (besides, there is no 2013 version of SQL Server, but that is probably just a typo). NVARCHAR is only ever UTF-16 Little Endian. Most likely your UTF-8 stream got converted into UTF-16 LE by the database driver during transit into SQL Server. This is the same encoding that an XML column would use, but the XML column would have tried to convert the stream from UTF-8 into UTF-16 but failed due to it already being UTF-16. This also means that on the way out of SQL Server, the XML document stored in the NVARCHAR column would still have the XML declaration stating that the encoding is UTF-8, but it’s definitely not UTF-8.

If you absolutely need the data to be UTF-8 on the way out because you don’t want to convert the UTF-16 LE coming out of SQL Server XML or NVARCHAR into UTF-8, then you have no choice but to store the data as VARBINARY(MAX).

Leave a Comment