MySQL Illegal mix of collations

It is helpful to understand the following definitions:

  • A character encoding details how each symbol is represented in binary (and therefore stored in the computer). For example, the symbol é (U+00E9, latin small letter E with acute) is encoded as 0xc3a9 in UTF-8 (which MySQL calls utf8) and 0xe9 in Windows-1252 (which MySQL calls latin1).

  • A character set is the alphabet of symbols that can be represented using a given character encoding. Confusingly, the term is also used to mean the same as character encoding.

  • A collation is an ordering on a character set, so that strings can be compared. For example: MySQL’s latin1_swedish_ci collation treats most accented variations of a character as equivalent to the base character, whereas its latin1_general_ci collation will order them before the next base character but not equivalent (there are other, more significant, differences too: such as the order of characters like å, ä, ö and ß).

MySQL will decide which collation should be applied to a given expression as documented under Collation of Expressions: in particular, the collation of a column takes precedence over that of a string literal.

The WHERE clause of your query compares the following strings:

  1. a value in fos_user.username, encoded in the column’s character set (Windows-1252) and expressing a preference for its collation latin1_swedish_ci (with a coercibility value of 2); with

  2. the string literal 'Nrv⧧Kasi', encoded in the connection’s character set (UTF-8, as configured by Doctrine) and expressing a preference for the connection’s collation utf8_general_ci (with a coercibility value of 4).

Since the first of these strings has a lower coercibility value than the second, MySQL attempts to perform the comparison using that string’s collation: latin1_swedish_ci. To do so, MySQL attempts to convert the second string to latin1—but since the character does not exist in that character set, the comparison fails.


Warning

One should pause for a moment to consider how the column is currently encoded: you are attempting to filter for records where fos_user.username is equal to a string that contains a character which cannot exist in that column!

If you believe that the column does contain such characters, then you probably wrote to the column whilst the connection character encoding was set to something (e.g. latin1) that caused MySQL to interpret the received byte sequence as characters which are all in the Windows-1252 character set.

If this is the case, before continuing any further you should fix your data!

  1. convert such columns to the character encoding that was used on data insertion, if different to the incumbent encoding:

    ALTER TABLE fos_users MODIFY username VARCHAR(123) CHARACTER SET foo;
    
  2. drop the encoding information associated with such columns by converting them to the binary character set:

    ALTER TABLE fos_users MODIFY username VARCHAR(123) CHARACTER SET binary;
    
  3. associate with such columns the encoding in which data was actually transmitted by converting them to the relevant character set.

    ALTER TABLE fos_users MODIFY username VARCHAR(123) CHARACTER SET bar;
    

Note that, if converting from a multi-byte encoding, you may need to increase the size of the column (or even change its type) in order to accomodate the maximum possible length of the converted string.


Once one is certain that the columns are correctly encoded, one could force the comparison to be conducted using a Unicode collation by either—

  • explicitly converting the value fos_user.username to a Unicode character set:

    WHERE CONVERT(fos_user.username USING utf8) = ?
    
  • forcing the string literal to have a lower coercibility value than the column (will cause an implicit conversion of the column’s value to UTF-8):

    WHERE fos_user.username = ? COLLATE utf8_general_ci
    

Or one could, as you say, permanently convert the column(s) to a Unicode encoding and set its collation appropriately.

Can I manually change the collation to utf8_general_ci for all my tables without any complications/precautions ?

The principle consideration is that Unicode encodings take up more space than single-byte character sets, so:

  • more storage may be required;

  • comparisons may be slower; and

  • index prefix lengths may need to be adjusted (note that the maximum is in bytes, so may represent fewer characters than previously).

Also, be aware that, as documented under ALTER TABLE Syntax:

To change the table default character set and all character columns (CHAR, VARCHAR, TEXT) to a new character set, use a statement like this:

ALTER TABLE tbl_name CONVERT TO CHARACTER SET charset_name;

For a column that has a data type of VARCHAR or one of the TEXT types, CONVERT TO CHARACTER SET will change the data type as necessary to ensure that the new column is long enough to store as many characters as the original column. For example, a TEXT column has two length bytes, which store the byte-length of values in the column, up to a maximum of 65,535. For a latin1 TEXT column, each character requires a single byte, so the column can store up to 65,535 characters. If the column is converted to utf8, each character might require up to three bytes, for a maximum possible length of 3 × 65,535 = 196,605 bytes. That length will not fit in a TEXT column’s length bytes, so MySQL will convert the data type to MEDIUMTEXT, which is the smallest string type for which the length bytes can record a value of 196,605. Similarly, a VARCHAR column might be converted to MEDIUMTEXT.

To avoid data type changes of the type just described, do not use CONVERT TO CHARACTER SET. Instead, use MODIFY to change individual columns.

Leave a Comment