preg_match and UTF-8 in PHP

Although the u modifier makes both the pattern and subject be interpreted as UTF-8, the captured offsets are still counted in bytes.

You can use mb_strlen to get the length in UTF-8 characters rather than bytes:

$str = "\xC2\xA1Hola!";
preg_match('/H/u', $str, $a_matches, PREG_OFFSET_CAPTURE);
echo mb_strlen(substr($str, 0, $a_matches[0][1]));

Leave a Comment