All original content is created in Ukrainian. Not all content has been translated yet. Some posts may only be available in Ukrainian.Learn more
This content has been automatically translated from Ukrainian.
Sorting text seems like a simple task — until you encounter the Ukrainian alphabet in Ruby.
["ґ", "г", "є", "е", "і", "и", "ї", "й"].sort

=> ["г", "е", "и", "й", "є", "і", "ї", "ґ"]
As we can see, "ґ" ended up at the very end. But in the Ukrainian alphabet, it should be the other way around: "ґ" comes after "г".

Unicode ≠ Alphabet

Ruby sorts strings by default according to Unicode codepoints, meaning by the technical order of characters in the Unicode table, not by the grammatical order of letters in the Ukrainian language.
uk_alphabet = "АБВГҐДЕЄЖЗИІЇЙКЛМНОПРСТУФХЦЧШЩЬЮЯ"
puts "Uppercase letters:"
uk_alphabet.each_char do |char|
  dec = char.ord
  hex = dec.to_s(16).upcase.rjust(4, '0')
  puts "#{char} -> dec: #{dec}, hex: U+#{hex}"
end

puts "\nLowercase letters:"
uk_alphabet.downcase.each_char do |char|
  dec = char.ord
  hex = dec.to_s(16).upcase.rjust(4, '0')
  puts "#{char} -> dec: #{dec}, hex: U+#{hex}"
end
Uppercase letters:
А -> dec: 1040, hex: U+0410
Б -> dec: 1041, hex: U+0411
В -> dec: 1042, hex: U+0412
Г -> dec: 1043, hex: U+0413
Ґ -> dec: 1168, hex: U+0490
Д -> dec: 1044, hex: U+0414
Е -> dec: 1045, hex: U+0415
Є -> dec: 1028, hex: U+0404
Ж -> dec: 1046, hex: U+0416
З -> dec: 1047, hex: U+0417
И -> dec: 1048, hex: U+0418
І -> dec: 1030, hex: U+0406
Ї -> dec: 1031, hex: U+0407
Й -> dec: 1049, hex: U+0419
К -> dec: 1050, hex: U+041A
Л -> dec: 1051, hex: U+041B
М -> dec: 1052, hex: U+041C
Н -> dec: 1053, hex: U+041D
О -> dec: 1054, hex: U+041E
П -> dec: 1055, hex: U+041F
Р -> dec: 1056, hex: U+0420
С -> dec: 1057, hex: U+0421
Т -> dec: 1058, hex: U+0422
У -> dec: 1059, hex: U+0423
Ф -> dec: 1060, hex: U+0424
Х -> dec: 1061, hex: U+0425
Ц -> dec: 1062, hex: U+0426
Ч -> dec: 1063, hex: U+0427
Ш -> dec: 1064, hex: U+0428
Щ -> dec: 1065, hex: U+0429
Ь -> dec: 1068, hex: U+042C
Ю -> dec: 1070, hex: U+042E
Я -> dec: 1071, hex: U+042F

Lowercase letters:
а -> dec: 1072, hex: U+0430
б -> dec: 1073, hex: U+0431
в -> dec: 1074, hex: U+0432
г -> dec: 1075, hex: U+0433
ґ -> dec: 1169, hex: U+0491
д -> dec: 1076, hex: U+0434
е -> dec: 1077, hex: U+0435
є -> dec: 1108, hex: U+0454
ж -> dec: 1078, hex: U+0436
з -> dec: 1079, hex: U+0437
и -> dec: 1080, hex: U+0438
і -> dec: 1110, hex: U+0456
ї -> dec: 1111, hex: U+0457
й -> dec: 1081, hex: U+0439
к -> dec: 1082, hex: U+043A
л -> dec: 1083, hex: U+043B
м -> dec: 1084, hex: U+043C
н -> dec: 1085, hex: U+043D
о -> dec: 1086, hex: U+043E
п -> dec: 1087, hex: U+043F
р -> dec: 1088, hex: U+0440
с -> dec: 1089, hex: U+0441
т -> dec: 1090, hex: U+0442
у -> dec: 1091, hex: U+0443
ф -> dec: 1092, hex: U+0444
х -> dec: 1093, hex: U+0445
ц -> dec: 1094, hex: U+0446
ч -> dec: 1095, hex: U+0447
ш -> dec: 1096, hex: U+0448
щ -> dec: 1097, hex: U+0449
ь -> dec: 1100, hex: U+044C
ю -> dec: 1102, hex: U+044E
я -> dec: 1103, hex: U+044F

Consequences of Unicode Order

Because Unicode codes are not arranged in alphabetical order, Ruby's standard sorting produces an incorrect order for Ukrainian words, especially for the letters "ґ", "є", "і", "ї".

My solution: ukrainian_sort - Ruby gem for correct sorting of Ukrainian words

To avoid this problem in my personal projects, I created the library ukrainian_sort, which implements sorting according to the official Ukrainian alphabet. I made it for use in my side projects. 
It compares words letter by letter, using its own order:
["г", "ґ", "е", "є", "и", "і", "ї", "й"]
require 'ukrainian_sort'

words = ["ґава", "груша", "єнот", "яблуко"]
sorted_words = UkrainianSort.sort(words)
puts sorted_words
# => ["груша", "ґава", "єнот", "яблуко"]
And this is how Ruby sorts "out of the box":
words.sort
=> ["груша", "яблуко", "єнот", "ґава"]

Why does everything work "normally" in English?

The English alphabet is a simple set of Latin letters from A to Z in a continuous range of Unicode (U+0041–U+005A for uppercase and U+0061–U+007A for lowercase). This means that the order of characters in Unicode matches the alphabetical order of the English language.
Therefore, Ruby's standard sorting (Array#sort or String#<=>) that compares characters by their Unicode codes works correctly for English words.

In which other languages does the sorting problem exist?

The answer is simple - in many. Many other languages have more complex alphabets where the order of letters in Unicode does not correspond to the linguistic order, for example:
  • German — where ä, ö, ü, ß have a special order;
  • Swedish — adds the letters å, ä, ö after z;
  • Czech, Slovak, Polish — have diacritical letters that do not follow sequentially in Unicode;
  • French, Spanish, Portuguese — different sorting rules with apostrophes, tildes, accents.

The Complexity of the Issue

Unicode assigns a universal character number, but does not define the linguistic order of sorting. For correct sorting of languages, special rules are often needed — collation rules that take into account:
  • The order of letters,
  • Special characters,
  • Accents
  • Whether to consider uppercase and lowercase letters as the same,
  • Ligatures and other linguistic features.

How is this usually solved?

  • Through localized sorters (for example, ICU — International Components for Unicode). But I quickly couldn't find a solution for myself.
  • Special libraries for sorting (gems, packages).
  • Manual definition of the order for a specific language.
For the English language, the Unicode sequence of characters matches the alphabetical order, so sort works without issues. But for most other languages, additional linguistic sorting rules need to be considered, because Unicode is just a code table, not a sorting algorithm.
ukrainian_sort (GitHub / RubyGems) - one of the first gems I released publicly. I usually create private repositories with solutions for myself. So feel free to create Issues and Pull requests. Most likely, there is still something to improve there.

This post doesn't have any additions from the author yet.