Properly print utf8 characters in windows console

By default the wide print functions on Windows do not handle characters outside the ascii range.

There are a few ways to get Unicode data to the Windows console.

use the console API directly, WriteConsoleW. You’ll have to ensure you’re actually writing to a console and use other means when the output is to something else.
set the mode of the standard output file descriptors to one of the ‘Unicode’ modes, _O_U16TEXT or _O_U8TEXT. This causes the wide character output functions to correctly output Unicode data to the Windows console. If they’re used on file descriptors that don’t represent a console then they cause the output stream of bytes to be UTF-16 and UTF-8 respectively. N.B. after setting these modes the non-wide character functions on the corresponding stream are unusable and result in a crash. You must use only the wide character functions.
UTF-8 text can be printed directly to the console by setting the console output codepage to CP_UTF8, if you use the right functions. Most of the higher level functions such as basic_ostream<char>::operator<<(char*) don’t work this way, but you can either use lower level functions or implement your own ostream that works around the problem the standard functions have.

The problem with the third method is this:

putc('\302'); putc('\260'); // doesn't work with CP_UTF8

puts("\302\260"); // correctly writes UTF-8 data to Windows console with CP_UTF8

Unlike most operating systems, the console on Windows is not simply another file that accepts a stream of bytes. It’s a special device created and owned by the program and accessed via its own unique WIN32 API. The issue is that when the console is written to, the API sees exactly the extent of the data passed in that use of its API, and the conversion from narrow characters to wide characters occurs without considering that the data may be incomplete. When a multibyte character is passed using more than one call to the console API, each separately passed piece is seen as an illegal encoding, and is treated as such.

It ought to be easy enough to work around this, but the CRT team at Microsoft views it as not their problem whereas whatever team works on the console probably doesn’t care.

You might solve it by implementing your own streambuf subclass which handles doing the conversion to wchar_t correctly. I.e. accounting for the fact that bytes of multibyte characters may come separately, maintaining conversion state between writes (e.g., std::mbstate_t).

More Related Contents:

Leave a Comment Cancel reply