International Domain Names (IDNs) are attractive. They allow people to express themselves in the multitude of languages which this planet has to offer. However, they also allow scammers and phishers to trick you into believing a particular domain name is trustworthy, while it is a scam. The idea is that homoglyphs could be used by criminals to trick you into believing to visit a trusted domain. For example if I write the first two letters of a pretty well-known domain name with Cyrillic letters like this: аоl.com
you won’t notice (unless your system lacks the necessary glyphs to show Cyrillic). However, a name such as xn--l-7sb6b.com
(its raw Punycode form) would raise some suspicion.
In my opinion the threat is real, but the counter-measures are not well thought. The idea of IDNs is to show domains in the native non-latin character-sets (similar systems for TLDs are in the works), so why would anyone not show them at all? Browsers such as Firefox and Internet Explorer 7 (on Vista) refuse to show the non-latin form of the IDN if certain characters are contained.
Let’s take the domain name I recently registered which reads сніжок.net. If you hover over the link in FF or IE, you will see this instead http://xn--f1aihfm1k.net
. The reason is simple, this IDN contains homoglyphs. So let’s dig into it a bit more.
For the curious ones: Сніжок is my nickname among Russian and Ukrainian friends. It is the Ukrainian form of Снежок, meaning roughly “snowball”. It stems from a certain dairy product I learned to love on the Crimea during my first two weeks in Ukraine back in 2000. Or to make it more clear: Сніжок это не только фруктовый кефир, Сніжок это стиль жизни 😉
The Polish transliteration would get closest with “Sniżok”. The “ż” (Cyrillic “ж”) is pronounced like “g in genre, s in pleasure, or zh (voiced retroflex fricative)”, its IPA form is /ʐ/. The rest just doesn’t have any English accent when pronounced Properly transliterated it’s written “Snizhok”.
Anyway, so this domain name contains homoglyphs. These are:
- С – Cyrillic S, visually identical to latin C
- Н – Cyrillic N, visually identical to latin H
- І – Cyrillic I (specific to Ukrainian, in Russian replaced by И since 1918), visually identical to latin I
- О – Cyrillic O, visually identical to latin O
- К – Cyrillic K, visually similar to latin K
The letter Ж clearly is no homoglyph of any latin letter. And that’s why I consider the approach of FF and IE silly. All parts of the domain name, except for the top level domain (TLD), .net
in this case, are Cyrillic and from the same Unicode range. There is no good reason whatsoever, to change this name to its raw Punycode form, since it couldn’t be used for phishing, except with exceptionally dumb victims who “mistake” Ж for an X or so.
The problem is, that those two browsers don’t seem to consider the context. They have a list of possible homoglyphs (check about:config
in FF) and if your IDN contains them, bad luck. However, this completely defeats the purpose of IDNs.
Don’t get me wrong. This anti-phishing initiative is all nice and perhaps even useful to most people. However, without context, many many completely legit IDNs will be subject to this mistreatment for no good reason. How many sentences can you fill without the letter “S” – or let’s take “N”, or “O” as a vowel, or “A” (Cyrillic: “А”, visually identical to latin “A”)? Russian is a rich language and when it comes to vowels it really gets hard to avoid them, just like in so many languages. So why this mistreatment?
I hope this is a temporary measure that will be refined in future. Let’s reconsider the аоl.com
example from the beginning. In this case not all of the used glyphs are from the same Unicode range. Two are Cyrillic, while the L is clearly latin. This is what I mean by context. There is good indication this is a phishing attempt from this fact. However, in general I’d prefer what we have in our justice system: the benefit of doubt.
To secure those parts which contain homoglyphs, checks could be done to see whether the name:
- is comprised of only homoglyphs (suspicious, even if all are from the same range)
- mixes homoglyphs from different ranges (very suspicious, but legit examples are imaginable)
- mixes non-homoglyphs and homoglyphs from the same range (not suspicious, because they can be told apart)
As a first instance, the respective domain registries should check these points and manually inspect the suspicious ones, even if it means a delay for the registrant. And after that, browsers should only use automated context-sensitive tests against a similar set of rules to tell apart phishing attempts from legit IDNs.
Just my two cents,
// Oliver
Update: link removed.