Character | Encoding | hex | dec (bytes) | dec | binary |
---|---|---|---|---|---|
z | UTF-8 | 7A | 122 | 122 | 01111010 |
é | UTF-8 | C3 A9 | 195 169 | 50089 | 11000011 10101001 |
We’ve been busy rebuilding Thread’s frontend codebase using React, and lately that’s included rebuilding our browse filters.
These filters allow customers to find products more easily by selecting which brands, colours or prices they want to see.
Because some of our filters contain a lot of options (we have over 500 brands available on Thread) we need them to be easily searchable, and sorted correctly.
Take the following list of brands:
const brands = [ | |
{ label: "Côte et Ciel", value: 1532 }, | |
{ label: "Études", value: 17 }, | |
{ label: "AllSaints", value: 2501 }, | |
{ label: "Samsøe & Samsøe", value: 1571 }, | |
{ label: "Ben Sherman", value: 2124 }, | |
{ label: "Drôle de Monsieur", value: 137 }, | |
{ label: "!Solid", value: 668 }, | |
]; |
Let’s see how we can best sort them intuitively for our users.
Sorting these in JavaScript initially seemed straightforward: just call sort()
on the brands array, passing a function to compare the two labels, eg:
brands.sort((a, b) => { | |
return a.label > b.label; | |
}); | |
=> [ | |
{ label: "!Solid", value: 668 }, | |
{ label: "AllSaints", value: 2501 }, | |
{ label: "Ben Sherman", value: 2124 }, | |
{ label: "Côte et Ciel", value: 1532 }, | |
{ label: "Drôle de Monsieur", value: 137 }, | |
{ label: "Samsøe & Samsøe", value: 1571 }, | |
{ label: "Études", value: 17 }, | |
] |
This appears to work, with the exception of Études
, which gets incorrectly
placed at the bottom of the list. JavaScript compares strings lexicographically,
but with limitations. It actually uses the character’s position in the Unicode
table to determine its ordering.
If we look the characters z and é we can see the following:
Character | Encoding | hex | dec (bytes) | dec | binary |
---|---|---|---|---|---|
z | UTF-8 | 7A | 122 | 122 | 01111010 |
é | UTF-8 | C3 A9 | 195 169 | 50089 | 11000011 10101001 |
Notice the decimal values for the characters; é
has a much higher value than
z
meaning it will be sorted after z
.
Fortunately, JavaScript has String.prototype.localeCompare
which meaning we
can simply rewrite the above function to be:
brands.sort((a, b) => { | |
return a.label.localeCompare(b.label); | |
}); | |
=> [ | |
{ label: "!Solid", value: 668 }, | |
{ label: "AllSaints", value: 2501 }, | |
{ label: "Ben Sherman", value: 2124 }, | |
{ label: "Côte et Ciel", value: 1532 }, | |
{ label: "Drôle de Monsieur", value: 137 }, | |
{ label: "Études", value: 17 }, | |
{ label: "Samsøe & Samsøe", value: 1571 }, | |
] |
Have a read of the docs to see what other arguments the localeCompare
function accepts.
Sorting was quite straightforward to solve, but what about searching?
Our browse filters provide a handy search input that filters large lists as you
type, however searching for Cote
was not yielding results for Côte et Ciel
.
const brands = [ | |
{ label: "Côte et Ciel", value: 1532 }, | |
{ label: "Études", value: 17 }, | |
{ label: "AllSaints", value: 2501 }, | |
{ label: "Samsøe & Samsøe", value: 1571 }, | |
{ label: "Ben Sherman", value: 2124 }, | |
{ label: "Drôle de Monsieur", value: 137 }, | |
{ label: "!Solid", value: 668 }, | |
]; | |
brands.filter(brand => { | |
return brand.label.indexOf("Cote") > -1; | |
}) | |
=> [] | |
brands.filter(brand => { | |
return brand.label.indexOf("Côte") > -1; | |
}) | |
=> [ | |
{ label: "Côte et Ciel", value: 1532 }, | |
] |
We didn’t think it was reasonable to expect our customers to know we’re using the brand spelling with diacritics, or to be able to easily input the right characters.
So in order to allow customers to find these brands, we need to remove the diacritical marks from the characters before comparing them with the search query.
We can do this in two parts.
First, we need to “decompose” the string so that any characters with diacritical marks are represented by their two-byte surrogate pair. This is a technique used in UTF-8 to represent a single character as a pair of two bytes.
We can do this using the normalize
function in JavaScript, and passing
the “NFD” form.
Looking at the docs, we can see that “NFD” stands for “Normalization Form Canonical Decomposition”–the important word being “Decomposition”.
We can see this working by comparing the string length before and after decomposition:
"Côte et Ciel".length | |
=> 12 | |
"Côte et Ciel".normalize('NFD').length | |
=> 13 |
This shows that the ô
character is now being represented by two bytes rather
than one.
Second, now we have separate bytes for the character and the diacritical mark,
we can remove the unwanted characters using the String.prototype.replace
function:
const brands = [ | |
{ label: "Côte et Ciel", value: 1532 }, | |
{ label: "Études", value: 17 }, | |
{ label: "AllSaints", value: 2501 }, | |
{ label: "Samsøe & Samsøe", value: 1571 }, | |
{ label: "Ben Sherman", value: 2124 }, | |
{ label: "Drôle de Monsieur", value: 137 }, | |
{ label: "!Solid", value: 668 }, | |
]; | |
brands.filter(brand => { | |
const decomposedLabel = | |
brand.label.normalize('NFD').replace(/[\u0300-\u036f]/g, ''); | |
return decomposedLabel.indexOf("Cote") > -1; | |
}) | |
=> [ | |
{ label: "Côte et Ciel", value: 1532 }, | |
] |
The replace
function takes a range of Unicode points from U+0300
to
U+036F
, covering any diacritic bytes we might have in the string.
Et voilà! This is a great technique to have in mind if you’re allowing your users to search lists of data that might include diacritics.