Unicode Identifiers and Source Code in C++11?

Is the new standard more open w.r.t to Unicode?

With respect to allowing universal character names in identifiers the answer is no; UCNs were allowed in identifiers back in C99 and C++98. However compilers did not implement that particular requirement until recently. Clang 3.3 I think introduces support for this and GCC has had an experimental feature for this for some time. Herb Sutter also mentioned during his Build 2013 talk “The Future of C++” that this feature would also be coming to VC++ at some point. (Although IIRC Herb refers to it as a C++11 feature; it is in fact a C++98 feature.)

It’s not expected that identifiers will be written using UCNs. Instead the expected behavior is to write the desired character using the source encoding. E.g., source will look like:

long pörk;

not:

long p\u00F6rk;

However UCNs are also useful for another purpose; Compilers are not all required to accept the same source encodings, but modern compilers all support some encoding scheme where at least the basic source characters have the same encoding (that is, modern compilers all support some ASCII compatible encoding).

UCNs allow you to write source code with only the basic characters and yet still name extended characters. This is useful in, for example, writing a string literal “°” in source code that will be compiled both as CP1252 and as UTF-8:

char const *degree_sign = "\u00b0";

This string literal is encoded into the appropriate execution encoding on multiple compilers, even when the source encodings differ, as long as the compilers at least share the same encoding for basic characters.

Can (portable) source code be in any unicode encoding, like UTF-8, UTF-16 or any (how-ever-defined) codepage?

It’s not required by the standard, but most compilers will accept UTF-8 source. Clang supports only UTF-8 source (although it has some compatibility for non-UTF-8 data in character and string literals), gcc allows the source encoding to be specified and includes support for UTF-8, and VC++ will guess at the encoding and can be made to guess UTF-8.

(Update: VS2015 now provides an option to force the source and execution character sets to be UTF-8.)

Can I write an identifier with \u1234 in it myfu\u1234ntion (for whatever purpose)

Yes, the specification mandates this, although as I said not all compilers implement this requirement yet.

Or can i use the “character names” that unicode defines like in the ICU, i.e.

const auto x = "German Braunb\U{LOWERCASE LETTER A WITH DIARESIS}r."u32;

No, you cannot use Unicode long names.

or even in an identifier in the source itself? That would be a treat… cough…

If the compiler supports a source code encoding that contains the extended character you want then that character written literally in the source must be treated exactly the same as the equivalent UCN. So yes, if you use a compiler that supports this requirement of the C++ spec then you may write any character in its source character set directly in the source without bothering with writing UCNs.

Leave a Comment