Classic ASP – How to convert a UTF-8 string to UCS-2?

My suspicion is you are falling foul of the classic form post character encoding mismatch problem.

It goes like this:-

  • You have a form which is presented to the client using the UTF-8 encoding.
  • As a result the browser posts text values entered into the form using UTF-8 encoding.
  • The action page receiving the post has its Response.Codepage set to a typical OEM codepage such as 1252.
  • Each byte of the posted UTF-8 string is treated by server as an individual character rather than decoding sets of UTF-8 encoded bytes to the correct unicode character.
  • The string is stored in the DB with the now corrupted characters.
  • A page wishes to present to the client the content of a DB field containing the corrupted characters.
  • The page sets it CharSet to UTF-8 but its Response.CodePage remains at the OEM codepage such as 1252.
  • Response.Write is used to send the field content to the client, the unicode characters are transformed back to the byte for byte set as was received in the ealier post.
  • The client thinks its getting UTF-8 hence it decodes the characters received from the server as UTF-8 just as they were originally hence they appear on screen correctly.
  • Everything proceeds fine as if all is ok whilst these characters are simply being bounced back and forth through ASP. A bug in one page has a matching bug in the other (could be the same page) which makes everything look fine.

If you examine the field contents directly with SQL server tools you will likely see the corrupted strings there. Now that you want to use this string with another component which is expecting a straight-forward unicode string this is where you discover this bug.

The solution is to always ensure all your pages not only send CharSet = “UTF-8” in the response but also use Response.CodePage = 65001 before using Response.Write and before attempting to read any Request.Form values. Use Codepage directive in the <%@ page header.

Now you are left with repairing the corrupt strings already in your DB.

Use an ADODB.Stream:-

Function ConvertFromUTF8(sIn)

    Dim oIn: Set oIn = CreateObject("ADODB.Stream")

    oIn.Open
    oIn.CharSet = "WIndows-1252"
    oIn.WriteText sIn
    oIn.Position = 0
    oIn.CharSet = "UTF-8"
    ConvertFromUTF8 = oIn.ReadText
    oIn.Close

End Function

This function (which BTW is the answer to your actual question) takes a corrupted string (one that has the byte of byte representation) and converts to the string it should have been. You need to apply this transform to every field in the DB that has fallen victim to the bug.

Leave a Comment