Table of contentsClick link to navigate to the desired location
This content has been automatically translated from Ukrainian.
Sorting text seems like a simple — task until you come across the Ukrainian alphabet in Ruby.
["g", "g", "is", "e", "i", "i", "i", "and"].sort => ["g", "e", "i", "and", "is", "i", "i", "g"]
As you can see, "g" got to the very end. But in ukrainian alphabet it should be the other way around: "г" stands after "g".
Unicode ≠ Alphabet
Ruby sorts the default strings by Unicode codepoints, that is, according to the technical order of the characters in the Unicode table, and not according to the grammatical order of the letters in the Ukrainian language.
uk_alphabet = "ABVGGDEEZZIIIIIKLMNOPRSTUFHCCCHSHSHCHYUYA"
puts "Upercase Letters:"
uk_alphabet.each_char do |char|
dec = char.ord
hex = dec.to_s(16).upcase.rjust(4, '0')
puts "#{char} -> dec: #{dec}, hex: U+#{hex}"
end
puts "\nSmall letters:"
uk_alphabet.downcase.each_char do |char|
dec = char.ord
hex = dec.to_s(16).upcase.rjust(4, '0')
puts "#{char} -> dec: #{dec}, hex: U+#{hex}"
end
Capital letters: A -> dec: 1040, hex: U+0410 B -> dec: 1041, hex: U+0411 In -> dec: 1042, hex: U+0412 G -> dec: 1043, hex: U+0413 G -> dec: 1168, hex: U+0490 D -> dec: 1044, hex: U+0414 E -> dec: 1045, hex: U+0415 Is -> dec: 1028, hex: U+0404 F -> dec: 1046, hex: U+0416 With -> dec: 1047, hex: U+0417 And -> dec: 1048, hex: U+0418 And -> dec: 1030, hex: U+0406 Y -> dec: 1031, hex: U+0407 Y -> dec: 1049, hex: U+0419 K -> dec: 1050, hex: U+041A L -> dec: 1051, hex: U+041B M -> dec: 1052, hex: U+041C H -> dec: 1053, hex: U+041D O -> dec: 1054, hex: U+041E P -> dec: 1055, hex: U+041F P -> dec: 1056, hex: U+0420 C -> dec: 1057, hex: U+0421 T -> dec: 1058, hex: U+0422 In -> dec: 1059, hex: U+0423 F -> dec: 1060, hex: U+0424 X -> dec: 1061, hex: U+0425 C -> dec: 1062, hex: U+0426 Ch -> dec: 1063, hex: U+0427 W -> dec: 1064, hex: U+0428 Sh -> dec: 1065, hex: U+0429 b -> dec: 1068, hex: U+042C Yu -> dec: 1070, hex: U+042E I -> dec: 1071, hex: U+042F Lowercase letters: a -> dec: 1072, hex: U+0430 b -> dec: 1073, hex: U+0431 in -> dec: 1074, hex: U+0432 g -> dec: 1075, hex: U+0433 g -> dec: 1169, hex: U+0491 d -> dec: 1076, hex: U+0434 e -> dec: 1077, hex: U+0435 is -> dec: 1108, hex: U+0454 f -> dec: 1078, hex: U+0436 with -> dec: 1079, hex: U+0437 and -> dec: 1080, hex: U+0438 and -> dec: 1110, hex: U+0456 y -> dec: 1111, hex: U+0457 y -> dec: 1081, hex: U+0439 k -> dec: 1082, hex: U+043A l -> dec: 1083, hex: U+043B m -> dec: 1084, hex: U+043C h -> dec: 1085, hex: U+043D o -> dec: 1086, hex: U+043E p -> dec: 1087, hex: U+043F p -> dec: 1088, hex: U+0440 c -> dec: 1089, hex: U+0441 t -> dec: 1090, hex: U+0442 in -> dec: 1091, hex: U+0443 f -> dec: 1092, hex: U+0444 x -> dec: 1093, hex: U+0445 ts -> dec: 1094, hex: U+0446 h -> dec: 1095, hex: U+0447 w -> dec: 1096, hex: U+0448 sh -> dec: 1097, hex: U+0449 б -> dec: 1100, hex: U+044C yu -> dec: 1102, hex: U+044E i -> dec: 1103, hex: U+044F
Consequences of Unicode order
Due to the fact that Unicode codes are not in alphabetical order, the standard Ruby sort gives the wrong order for Ukrainian words, especially for the letters "г", "е", "и", "и".
My solution: Ukrainian_sort - Ruby-gem for correct sorting of Ukrainian words
To avoid this problem in my personal projects, I created the ukrainian_sort library, which implements sorting according to the official Ukrainian alphabet. Made for use in his side projects.
She compares words letter by letter using her own order:["g", "g", "e", "e", "y", "y", "i", "y", "y"]
demand 'ukrainian_sort' words = ["gawa", "pear", "raccoon", "apple"] sorted_words = UkrainianSort.sort(words) puts sorted_words # => ["pear", "gawa", "raccoon", "apple"]
And so sorts Ruby "out of the box":
words.sort => ["pear", "apple", "raccoon", "gava"]
Why does everything work "normally" in English?
The English alphabet — is a simple set of Latin letters A to Z in the continuous Unicode range (U+0041–U+005A for uppercase and U+0061–U+007A for lowercase letters). This means that the character order in Unicode matches the alphabetical order of the English language.
Therefore, the standard Ruby sorting (Array#sort or String#<=>), which compares characters by their Unicode codes, works correctly for English words.
In what other languages is there a sorting problem?
The answer is simple - many have. Many other languages have more complex alphabets where the order of letters in Unicode does not correspond to linguistic order, for example:
- German <TAG1> where ä, ö, ü, ß have a special order;
- Swedish <TAG1> adds the letters å, ä, ö after z;
- Czech, Slovak, Polish <TAG1> have diacritics that do not follow Unicode sequentially;
- French, Spanish, Portuguese <TAG1> different sorting rules with apostrophes, tildes, accents.
Difficulty of the question
Unicode sets universal symbol number, but does not define a linguistic sorting order. Correct sorting of languages often requires special — rules collation rules, which take into account:
- Letter order,
- Special symbols,
- Accents
- Whether to consider upper and lower case letters as the same,
- Ligatures and other language features.
How is it usually solved?
- Through localized sorters (for example, ICU — International Components for Unicode). But I could not find any solution for myself quickly.
- Custom sorting libraries (gem's, packages).
- Manual order determination for a specific language.
For English, the Unicode sequence of characters matches the alphabetical order, so sort works without problems. But for most other languages, linguistic sorting rules must be additionally taken into account, because Unicode — is only a code table, not a sorting algorithm.
ukrainian_sort (GitHub /RubyGems) is one of the first gemmies I downloaded publicly. I usually make private repositories with solutions for myself. So you can safely do Issues and Pull-requests. Most likely, there is still something to fix.
This post doesn't have any additions from the author yet.