Are the PHP preg_functions multibyte safe?

pcre supports utf8 out of the box, see documentation for the ‘u’ modifier. Illustration (\xC3\xA4 is the utf8 encoding for the german letter “ä”) echo preg_replace(‘~\w~’, ‘@’, “a\xC3\xA4b”); this echoes “@@¤@” because “\xC3” and “\xA4” were treated as distinct symbols echo preg_replace(‘~\w~u’, ‘@’, “a\xC3\xA4b”); (note the ‘u’) prints “@@@” because “\xC3\xA4” were treated as a … Read more

Ruby 1.9: how can I properly upcase & downcase multibyte strings?

for anybody coming from Google by ruby upcase utf8: > “your problem chars here çöğıü Iñtërnâtiônàlizætiøn”.mb_chars.upcase.to_s => “YOUR PROBLEM CHARS HERE ÇÖĞIÜ IÑTËRNÂTIÔNÀLIZÆTIØN” solution is to use mb_chars. Documentation: https://www.rubydoc.info/gems/activesupport/String#mb_chars-instance_method https://api.rubyonrails.org/classes/ActiveSupport/Multibyte/Chars.html

str_replace() on multibyte strings dangerous?

No, you’re right: Using a singlebyte string function on a multibyte string can cause an unexpected result. Use the multibyte string functions instead, for example mb_ereg_replace or mb_split: $string = mb_ereg_replace(‘”‘, ‘\\”‘, $string); $string = implode(‘\\”‘, mb_split(‘”‘, $string)); Edit    Here’s a mb_replace implementation using the split-join variant: function mb_replace($search, $replace, $subject, &$count=0) { if (!is_array($search) && … Read more

Multibyte trim in PHP?

The standard trim function trims a handful of space and space-like characters. These are defined as ASCII characters, which means certain specific bytes from 0 to 0100 0000. Proper UTF-8 input will never contain multi-byte characters that is made up of bytes 0xxx xxxx. All the bytes in proper UTF-8 multibyte characters start with 1xxx … Read more

Printing UTF-8 strings with printf – wide vs. multibyte string literals

printf(“ο Δικαιοπολις εν αγρω εστιν\n”); prints the string literal (const char*, special characters are represented as multibyte characters). Although you might see the correct output, there are other problems you might be dealing with while working with non-ASCII characters like these. For example: char str[] = “αγρω”; printf(“%d %d\n”, sizeof(str), strlen(str)); outputs 9 8, since … Read more