How do I sort unicode strings alphabetically in Python?

IBM’s ICU library does that (and a lot more). It has Python bindings: PyICU. Update: The core difference in sorting between ICU and locale.strcoll is that ICU uses the full Unicode Collation Algorithm while strcoll uses ISO 14651. The differences between those two algorithms are briefly summarized here: http://unicode.org/faq/collation.html#13. These are rather exotic special cases, … Read more

How do I see what character set a MySQL database / table / column is?

Here’s how I’d do it – For Schemas (or Databases – they are synonyms): SELECT default_character_set_name FROM information_schema.SCHEMATA WHERE schema_name = “schemaname”; For Tables: SELECT CCSA.character_set_name FROM information_schema.`TABLES` T, information_schema.`COLLATION_CHARACTER_SET_APPLICABILITY` CCSA WHERE CCSA.collation_name = T.table_collation AND T.table_schema = “schemaname” AND T.table_name = “tablename”; For Columns: SELECT character_set_name FROM information_schema.`COLUMNS` WHERE table_schema = “schemaname” AND table_name … Read more

Efficiently replace all accented characters in a string?

Here is a more complete version based on the Unicode standard. var Latinise={};Latinise.latin_map={“Á”:”A”, “Ă”:”A”, “Ắ”:”A”, “Ặ”:”A”, “Ằ”:”A”, “Ẳ”:”A”, “Ẵ”:”A”, “Ǎ”:”A”, “”:”A”, “Ấ”:”A”, “Ậ”:”A”, “Ầ”:”A”, “Ẩ”:”A”, “Ẫ”:”A”, “Ä”:”A”, “Ǟ”:”A”, “Ȧ”:”A”, “Ǡ”:”A”, “Ạ”:”A”, “Ȁ”:”A”, “À”:”A”, “Ả”:”A”, “Ȃ”:”A”, “Ā”:”A”, “Ą”:”A”, “Å”:”A”, “Ǻ”:”A”, “Ḁ”:”A”, “Ⱥ”:”A”, “Ô:”A”, “Ꜳ”:”AA”, “Æ”:”AE”, “Ǽ”:”AE”, “Ǣ”:”AE”, “Ꜵ”:”AO”, “Ꜷ”:”AU”, “Ꜹ”:”AV”, “Ꜻ”:”AV”, “Ꜽ”:”AY”, “Ḃ”:”B”, “Ḅ”:”B”, “Ɓ”:”B”, “Ḇ”:”B”, … Read more

What’s the difference between utf8_general_ci and utf8_unicode_ci?

For those people still arriving at this question in 2020 or later, there are newer options that may be better than both of these. For example, utf8mb4_0900_ai_ci. All these collations are for the UTF-8 character encoding. The differences are in how text is sorted and compared. _unicode_ci and _general_ci are two different sets of rules … Read more