multibyte - w3toppers.com

Multi-byte safe wordwrap() function for UTF-8

I haven’t found any working code for me. Here is what I’ve written. For me it is working, thought it is probably not the fastest. function mb_wordwrap($str, $width = 75, $break = “\n”, $cut = false) { $lines = explode($break, $str); foreach ($lines as &$line) { $line = rtrim($line); if (mb_strlen($line) <= $width) continue; $words … Read more

Are the PHP preg_functions multibyte safe?

pcre supports utf8 out of the box, see documentation for the ‘u’ modifier. Illustration (\xC3\xA4 is the utf8 encoding for the german letter “ä”) echo preg_replace(‘~\w~’, ‘@’, “a\xC3\xA4b”); this echoes “@@¤@” because “\xC3” and “\xA4” were treated as distinct symbols echo preg_replace(‘~\w~u’, ‘@’, “a\xC3\xA4b”); (note the ‘u’) prints “@@@” because “\xC3\xA4” were treated as a … Read more

Ruby 1.9: how can I properly upcase & downcase multibyte strings?

for anybody coming from Google by ruby upcase utf8: > “your problem chars here çöğıü Iñtërnâtiônàlizætiøn”.mb_chars.upcase.to_s => “YOUR PROBLEM CHARS HERE ÇÖĞIÜ IÑTËRNÂTIÔNÀLIZÆTIØN” solution is to use mb_chars. Documentation: https://www.rubydoc.info/gems/activesupport/String#mb_chars-instance_method https://api.rubyonrails.org/classes/ActiveSupport/Multibyte/Chars.html

str_replace() on multibyte strings dangerous?

No, you’re right: Using a singlebyte string function on a multibyte string can cause an unexpected result. Use the multibyte string functions instead, for example mb_ereg_replace or mb_split: $string = mb_ereg_replace(‘”‘, ‘\\”‘, $string); $string = implode(‘\\”‘, mb_split(‘”‘, $string)); Edit Here’s a mb_replace implementation using the split-join variant: function mb_replace($search, $replace, $subject, &$count=0) { if (!is_array($search) && … Read more

Multibyte trim in PHP?

The standard trim function trims a handful of space and space-like characters. These are defined as ASCII characters, which means certain specific bytes from 0 to 0100 0000. Proper UTF-8 input will never contain multi-byte characters that is made up of bytes 0xxx xxxx. All the bytes in proper UTF-8 multibyte characters start with 1xxx … Read more

How does UTF-8 “variable-width encoding” work?

Each byte starts with a few bits that tell you whether it’s a single byte code-point, a multi-byte code point, or a continuation of a multi-byte code point. Like this: 0xxx xxxx A single-byte US-ASCII code (from the first 127 characters) The multi-byte code-points each start with a few bits that essentially say “hey, you … Read more

Printing UTF-8 strings with printf – wide vs. multibyte string literals

printf(“ο Δικαιοπολις εν αγρω εστιν\n”); prints the string literal (const char*, special characters are represented as multibyte characters). Although you might see the correct output, there are other problems you might be dealing with while working with non-ASCII characters like these. For example: char str[] = “αγρω”; printf(“%d %d\n”, sizeof(str), strlen(str)); outputs 9 8, since … Read more

Truncate a multibyte String to n chars

Just found out PHP already has a multibyte truncate with mb_strimwidth — Get truncated string with specified width It doesn’t obey word boundaries though. But handy nonetheless!