Why Ruby incorrectly sorts Ukrainian letters — and how to fix it

24 Jul 21:14

5 min. read

Post cover: Why Ruby incorrectly sorts Ukrainian letters — and how to fix it

Table of contentsClick link to navigate to the desired location

Unicode ≠ Alphabet Consequences of Unicode order My solution: Ukrainian_sort - Ruby-gem for correct sorting of Ukrainian words Why does everything work "normally" in English?In what other languages is there a sorting problem?Difficulty of the question How is it usually solved?

This content has been automatically translated from Ukrainian.

View Original

Sorting text seems like a simple — task until you come across the Ukrainian alphabet in Ruby.

["g", "g", "is", "e", "i", "i", "i", "and"].sort

=> ["g", "e", "i", "and", "is", "i", "i", "g"]

As you can see, "g" got to the very end. But in ukrainian alphabet it should be the other way around: "г" stands after "g".

Unicode ≠ Alphabet

Ruby sorts the default strings by Unicode codepoints, that is, according to the technical order of the characters in the Unicode table, and not according to the grammatical order of the letters in the Ukrainian language.

uk_alphabet = "ABVGGDEEZZIIIIIKLMNOPRSTUFHCCCHSHSHCHYUYA"
puts "Upercase Letters:"
uk_alphabet.each_char do |char|
  dec = char.ord
  hex = dec.to_s(16).upcase.rjust(4, '0')
  puts "#{char} -> dec: #{dec}, hex: U+#{hex}"
end

puts "\nSmall letters:"
uk_alphabet.downcase.each_char do |char|
  dec = char.ord
  hex = dec.to_s(16).upcase.rjust(4, '0')
  puts "#{char} -> dec: #{dec}, hex: U+#{hex}"
end

Capital letters:
A -> dec: 1040, hex: U+0410
B -> dec: 1041, hex: U+0411
In -> dec: 1042, hex: U+0412
G -> dec: 1043, hex: U+0413
G -> dec: 1168, hex: U+0490
D -> dec: 1044, hex: U+0414
E -> dec: 1045, hex: U+0415
Is -> dec: 1028, hex: U+0404
F -> dec: 1046, hex: U+0416
With -> dec: 1047, hex: U+0417
And -> dec: 1048, hex: U+0418
And -> dec: 1030, hex: U+0406
Y -> dec: 1031, hex: U+0407
Y -> dec: 1049, hex: U+0419
K -> dec: 1050, hex: U+041A
L -> dec: 1051, hex: U+041B
M -> dec: 1052, hex: U+041C
H -> dec: 1053, hex: U+041D
O -> dec: 1054, hex: U+041E
P -> dec: 1055, hex: U+041F
P -> dec: 1056, hex: U+0420
C -> dec: 1057, hex: U+0421
T -> dec: 1058, hex: U+0422
In -> dec: 1059, hex: U+0423
F -> dec: 1060, hex: U+0424
X -> dec: 1061, hex: U+0425
C -> dec: 1062, hex: U+0426
Ch -> dec: 1063, hex: U+0427
W -> dec: 1064, hex: U+0428
Sh -> dec: 1065, hex: U+0429
b -> dec: 1068, hex: U+042C
Yu -> dec: 1070, hex: U+042E
I -> dec: 1071, hex: U+042F

Lowercase letters:
a -> dec: 1072, hex: U+0430
b -> dec: 1073, hex: U+0431
in -> dec: 1074, hex: U+0432
g -> dec: 1075, hex: U+0433
g -> dec: 1169, hex: U+0491
d -> dec: 1076, hex: U+0434
e -> dec: 1077, hex: U+0435
is -> dec: 1108, hex: U+0454
f -> dec: 1078, hex: U+0436
with -> dec: 1079, hex: U+0437
and -> dec: 1080, hex: U+0438
and -> dec: 1110, hex: U+0456
y -> dec: 1111, hex: U+0457
y -> dec: 1081, hex: U+0439
k -> dec: 1082, hex: U+043A
l -> dec: 1083, hex: U+043B
m -> dec: 1084, hex: U+043C
h -> dec: 1085, hex: U+043D
o -> dec: 1086, hex: U+043E
p -> dec: 1087, hex: U+043F
p -> dec: 1088, hex: U+0440
c -> dec: 1089, hex: U+0441
t -> dec: 1090, hex: U+0442
in -> dec: 1091, hex: U+0443
f -> dec: 1092, hex: U+0444
x -> dec: 1093, hex: U+0445
ts -> dec: 1094, hex: U+0446
h -> dec: 1095, hex: U+0447
w -> dec: 1096, hex: U+0448
sh -> dec: 1097, hex: U+0449
б -> dec: 1100, hex: U+044C
yu -> dec: 1102, hex: U+044E
i -> dec: 1103, hex: U+044F

Consequences of Unicode order

Due to the fact that Unicode codes are not in alphabetical order, the standard Ruby sort gives the wrong order for Ukrainian words, especially for the letters "г", "е", "и", "и".

My solution: Ukrainian_sort - Ruby-gem for correct sorting of Ukrainian words

To avoid this problem in my personal projects, I created the ukrainian_sort library, which implements sorting according to the official Ukrainian alphabet. Made for use in his side projects.

She compares words letter by letter using her own order:

["g", "g", "e", "e", "y", "y", "i", "y", "y"]

demand 'ukrainian_sort'

words = ["gawa", "pear", "raccoon", "apple"]
sorted_words = UkrainianSort.sort(words)
puts sorted_words
# => ["pear", "gawa", "raccoon", "apple"]

And so sorts Ruby "out of the box":

words.sort
=> ["pear", "apple", "raccoon", "gava"]

Why does everything work "normally" in English?

The English alphabet — is a simple set of Latin letters A to Z in the continuous Unicode range (U+0041–U+005A for uppercase and U+0061–U+007A for lowercase letters). This means that the character order in Unicode matches the alphabetical order of the English language.

Therefore, the standard Ruby sorting (Array#sort or String#<=>), which compares characters by their Unicode codes, works correctly for English words.

In what other languages is there a sorting problem?

The answer is simple - many have. Many other languages have more complex alphabets where the order of letters in Unicode does not correspond to linguistic order, for example:

German <TAG1> where ä, ö, ü, ß have a special order;
Swedish <TAG1> adds the letters å, ä, ö after z;
Czech, Slovak, Polish <TAG1> have diacritics that do not follow Unicode sequentially;
French, Spanish, Portuguese <TAG1> different sorting rules with apostrophes, tildes, accents.

Difficulty of the question

Unicode sets universal symbol number, but does not define a linguistic sorting order. Correct sorting of languages often requires special — rules collation rules, which take into account:

Letter order,
Special symbols,
Accents
Whether to consider upper and lower case letters as the same,
Ligatures and other language features.

How is it usually solved?

Through localized sorters (for example, ICU — International Components for Unicode). But I could not find any solution for myself quickly.
Custom sorting libraries (gem's, packages).
Manual order determination for a specific language.

For English, the Unicode sequence of characters matches the alphabetical order, so sort works without problems. But for most other languages, linguistic sorting rules must be additionally taken into account, because Unicode — is only a code table, not a sorting algorithm.

ukrainian_sort (GitHub /RubyGems) is one of the first gemmies I downloaded publicly. I usually make private repositories with solutions for myself. So you can safely do Issues and Pull-requests. Most likely, there is still something to fix.

This post doesn't have any additions from the author yet.