Parsing through Arabic / RTL text from left to right

As your string currently stands, the word لطيفة is stored prior to the word اليوم; the fact that اليوم is displayed “first” (that is, further to the left), is just a (correct) result of the Unicode Bidirectional Algorithm in displaying the text.

That is: the string you start with (“Test:لطيفة;اليوم;a;b”) is the result of the user entering “Test:”, then لطيفة, then “;”, then اليوم, and then “;a;b”. Thus, the way C# is splitting it does in fact mirror the way that the string is created. It’s just that the way it is created is not reflected in the display of the string, because the two consecutive Arabic words are treated as a single unit when they are displayed.

If you’d like a string to display Arabic words in left-to-right order with semicolons in between, while also storing the words in that same order, then you should put a Left-to-Right mark (U+200E) after the semicolon. This will effectively section off each Arabic word as its own unit, and the Bidirectional Algorithm will then treat each word separately.

For instance, the following code begins with a string identical to the one you use (with the addition of a single Left-to-Right mark), yet it will split it up according to the way that you are expecting it to (that is, spl[0] = ‏”Test:اليوم”, and spl[1] = “‏لطيفة”):

static void Main(string[] args) {
    string s = "Test:اليوم;\u200Eلطيفة;a;b";
    string[] spl = s.Split(';');
}

Leave a Comment