Sorting – Language-Agnostic Specification for String Natural Sorting Order

comparisonlanguage-agnosticsortingstrings

As I painfully try to find a good natural sorting algorithm written in JavaScript I then stumble upon a bunch of different implementations, & interesting blog posts & answers on Stack Overflow.

Each implementation provides its technical tricks, however the more I looked into it the more a question became very clear: "is there actually any language-agnostic specification regarding natural sorting ORDER of strings???"

I mean, if not, then how could one expect to write a piece of code that is actually "correct for everyone" or "agreed on by the community"? I would have expected a spec stating the result of the compromises/decisions made, at least for English, as it is simple (no accents/diacritics) …

Note that I wrote "language-agnostic" as I would expect this spec to then be used to implement solutions in different languages, not only in javascript or C# or Java.

Resources:

Best Answer

The algorithms for determining which string comes first when comparing two strings are called collation algorithms and the sort order they produce is called the collation order.

Unfortunately, there is no agreed upon global collation order. To make matters worse, the correct sorting order is not only language dependent, but can even differ between different contexts.
One example of language difference is that in German the accented characters are ordered immediately after their unaccented counterparts (ö comes immediately after o), but in Swedish the accented characters come right at the end of the alphabet (ö comes after z). And as for usage differences, phone books and dictionaries can have different sort orders.

Although there is no global collation order, there are collation orders that generally give a reasonable order independent of the natural language that the words are written in and there are collation algorithms that can be tailored to either give a reasonable sort order or to give the absolute correct order for a given culture and usage.

One such algorithm is the "Unicode Collation Algorithm", which can be found at http://www.unicode.org/reports/tr10/. This algorithm can be tailored for a wide range of collation orders and comes with a default configuration that gives a reasonable ordering for all Unicode codepoints. The algorithm does not depend on any particular programming language.
The introduction section of the standard gives a nice overview of the difficulties in correctly collating text.

Another algorithm is described in ISO standard 14651.

Besides the various national collation orders, there is also a standardized collation order for the European languages, called the European Ordering Rules (EOR).

Related Topic