Searching and sorting text with diacritical marks in JavaScript

Paul Springett
29 August, 2018

We’ve been busy rebuilding Thread’s frontend codebase using React, and lately that’s included rebuilding our browse filters.

Searchable browse filters on Thread

These filters allow customers to find products more easily by selecting which brands, colours or prices they want to see.

Because some of our filters contain a lot of options (we have over 500 brands available on Thread) we need them to be easily searchable, and sorted correctly.

Sorting

Take the following list of brands:

const brands = [
 { label: "Côte et Ciel", value: 1532 },
 { label: "Études", value: 17 },
 { label: "AllSaints", value: 2501 },
 { label: "Samsøe & Samsøe", value: 1571 },
 { label: "Ben Sherman", value: 2124 },
 { label: "Drôle de Monsieur", value: 137 },
 { label: "!Solid", value: 668 },
];
view raw brands.js hosted with ❤ by GitHub

Let’s see how we can best sort them intuitively for our users.

Sorting these in JavaScript initially seemed straightforward: just call sort() on the brands array, passing a function to compare the two labels, eg:

brands.sort((a, b) => {
return a.label > b.label;
});
=> [
{ label: "!Solid", value: 668 },
{ label: "AllSaints", value: 2501 },
{ label: "Ben Sherman", value: 2124 },
{ label: "Côte et Ciel", value: 1532 },
{ label: "Drôle de Monsieur", value: 137 },
{ label: "Samsøe & Samsøe", value: 1571 },
{ label: "Études", value: 17 },
]
view raw brands-sort-fail.js hosted with ❤ by GitHub

This appears to work, with the exception of Études, which gets incorrectly placed at the bottom of the list. JavaScript compares strings lexicographically, but with limitations. It actually uses the character’s position in the Unicode table to determine its ordering.

If we look the characters z and é we can see the following:

Character Encoding hex dec (bytes) dec binary
z UTF-8 7A 122 122 01111010
é UTF-8 C3 A9 195 169 50089 11000011 10101001
view raw utf-8-chars.md hosted with ❤ by GitHub

Notice the decimal values for the characters; é has a much higher value than z meaning it will be sorted after z.

Fortunately, JavaScript has String.prototype.localeCompare which meaning we can simply rewrite the above function to be:

brands.sort((a, b) => {
return a.label.localeCompare(b.label);
});
=> [
{ label: "!Solid", value: 668 },
{ label: "AllSaints", value: 2501 },
{ label: "Ben Sherman", value: 2124 },
{ label: "Côte et Ciel", value: 1532 },
{ label: "Drôle de Monsieur", value: 137 },
{ label: "Études", value: 17 },
{ label: "Samsøe & Samsøe", value: 1571 },
]

Have a read of the docs to see what other arguments the localeCompare function accepts.

Searching

Sorting was quite straightforward to solve, but what about searching?

Our browse filters provide a handy search input that filters large lists as you type, however searching for Cote was not yielding results for Côte et Ciel.

const brands = [
{ label: "Côte et Ciel", value: 1532 },
{ label: "Études", value: 17 },
{ label: "AllSaints", value: 2501 },
{ label: "Samsøe & Samsøe", value: 1571 },
{ label: "Ben Sherman", value: 2124 },
{ label: "Drôle de Monsieur", value: 137 },
{ label: "!Solid", value: 668 },
];
brands.filter(brand => {
return brand.label.indexOf("Cote") > -1;
})
=> []
brands.filter(brand => {
return brand.label.indexOf("Côte") > -1;
})
=> [
{ label: "Côte et Ciel", value: 1532 },
]

We didn’t think it was reasonable to expect our customers to know we’re using the brand spelling with diacritics, or to be able to easily input the right characters.

So in order to allow customers to find these brands, we need to remove the diacritical marks from the characters before comparing them with the search query.

We can do this in two parts.

First, we need to “decompose” the string so that any characters with diacritical marks are represented by their two-byte surrogate pair. This is a technique used in UTF-8 to represent a single character as a pair of two bytes.

We can do this using the normalize function in JavaScript, and passing the "NFD" form.

Looking at the docs, we can see that “NFD” stands for “Normalization Form Canonical Decomposition”–the important word being “Decomposition”.

We can see this working by comparing the string length before and after decomposition:

"Côte et Ciel".length
=> 12
"Côte et Ciel".normalize('NFD').length
=> 13

This shows that the ô character is now being represented by two bytes rather than one.

Second, now we have separate bytes for the character and the diacritical mark, we can remove the unwanted characters using the String.prototype.replace function:

const brands = [
{ label: "Côte et Ciel", value: 1532 },
{ label: "Études", value: 17 },
{ label: "AllSaints", value: 2501 },
{ label: "Samsøe & Samsøe", value: 1571 },
{ label: "Ben Sherman", value: 2124 },
{ label: "Drôle de Monsieur", value: 137 },
{ label: "!Solid", value: 668 },
];
brands.filter(brand => {
const decomposedLabel =
brand.label.normalize('NFD').replace(/[\u0300-\u036f]/g, '');
return decomposedLabel.indexOf("Cote") > -1;
})
=> [
{ label: "Côte et Ciel", value: 1532 },
]

The replace function takes a range of Unicode points from U+0300 to U+036F, covering any diacritic bytes we might have in the string.

Et voilà! This is a great technique to have in mind if you’re allowing your users to search lists of data that might include diacritics.

Searching and sorting text with diacritical marks in JavaScript

Paul Springett
29 August, 2018