Homoglyph
This article has multiple issues. Please help improve it or discuss these issues on the talk page. (Learn how and when to remove these template messages)
|
In orthography and typography, a homoglyph is one of two or more graphemes, characters, or glyphs with shapes that appear identical or very similar but may have differing meaning. The designation is also applied to sequences of characters sharing these properties.
In 2008, the Unicode Consortium published its Technical Report #36[1] on a range of issues deriving from the visual similarity of characters both in single scripts, and similarities between characters in different scripts.
Examples of homoglyphic symbols are (a) the
Related terms
The term homograph is sometimes misused synonymously with homoglyph, but in the usual linguistic sense, homographs are words that are spelled the same but have different meanings, a property of words, not characters.
Allographs are typeface design variants that look different but mean the same thing – for example ⟨g⟩ and ⟨g⟩, or a dollar sign with one or two strokes. The term synoglyph has a similar but a little more abstract meaning – for example the symbol ⟨£⟩ and the letter ⟨L⟩ (in Lsd) both mean the pound sterling,[2] but only in that context. Allographs and synoglyphs are also known informally as display variants.
Umlaut and diaresis
In the days of early mechanical typewriters these were typed with the same key (using the "backspace and over-type" technique), which was also used for a double inverted comma. However the umlaut originated specifically as a pair of short vertical lines (not two dots) (see
0 and O; 1, l and I
Two common and important sets of homoglyphs in use today are the digit zero and the capital letter O (i.e. 0 and O); and the digit one, the lowercase letter L and the uppercase i (i.e. 1, l and I). In the early days of mechanical typewriters there was very little or no visual difference between these glyphs, and typists treated them interchangeably as keyboarding shortcuts. In fact, most keyboards did not even have a key for the digit "1", requiring users to type the letter "l" instead, and some also omitted 0. As these same typists transitioned in the 1970s and 1980s to being computer keyboard operators, their old keyboarding habits continued with them, and was an occasional source of confusion.
Most current type designs carefully distinguish between these homoglyphs, usually by drawing the digit zero narrower and drawing the digit one with prominent
Some type designs conform to the
An example of confusion due to near-homoglyphs arose from the use of a ⟨y⟩ to represent a ⟨þ⟩ (thorn). Early English typesetters imported Dutch typesets that did not contain the latter character, so used the letter ⟨y⟩ instead because (in Blackletter typeface) they look sufficiently similar.[6] It has led in modern times to such phenomena as Ye olde shoppe, implying incorrectly that the word the was formerly written ye /jiː/ rather than þe. The spelling of the name Menzies (pronounced Mengis and originally spelled Menʒies) arose for the same reason: the letter ⟨z⟩ was substituted for ⟨ʒ⟩ (yogh).
Multi-letter homoglyphs
Some other combinations of letters look similar, for instance rn looks similar to m, cl looks similar to d, and vv looks similar to w.
In certain narrow-spaced fonts (such as Tahoma), placing the letter c next to a letter such as j, l or i will create a homoglyph, such as cj cl ci (g d a).
When some characters are placed next to each other, seen together at a glance they give the visual impression of another, unrelated character. A more precise way of saying this is that some
Unicode homoglyphs
Efforts by
Relevant documentation will be found both on the developers' Web sites, and on an IDN Forum[9] provided by ICANN.
In Cyrillic, Cyrillic
Canonicalization
Homoglyphs of all kinds can be detected through a process called 'dual canonicalization'.[4] The first step in this process is to identify homoglyph sets, namely characters appearing the same to a given observer. From here, a single token is specified to represent the homoglyph set. This token is called a canon. The next step is to convert each character in the text to the corresponding canon in a process called canonicalization. If the canons of two runs of text are the same but the original text is different, then a homoglyph exists in the text.
Homoglyph prevention
Homoglyph attacks can be mitigated through a combination of user awareness and proactive measures. It is crucial to educate users about the risks associated with homoglyph attacks, urging them to meticulously inspect URLs before clicking.[10] Employing advanced security solutions, particularly those capable of scanning for homoglyph variations in domain names, can automate the detection and prevention of potential threats. Additionally, implementing stringent domain name monitoring and registration policies can help identify and neutralize homoglyph-related risks promptly. By fostering a culture of cyber vigilance and leveraging cutting-edge technologies, organizations can fortify their defenses against homoglyph attacks, ensuring a more secure online environment.
See also
- IDN homograph attack – Visually similar letters in domain names
- Duplicate characters in Unicode – Unicode 2.0
- Vehicle registration plates of Bosnia and Herzegovina use only numbers and letters that look the same in the Latin and Cyrillic alphabets.
- Yaminjeongeum, South Korean language game of intentionally substituting Hangul characters for homoglyphs.
References
- ^ a b "UTR #36: Unicode Security Considerations". www.unicode.org.
- ^ Walton, Chas (October 7, 2020). "A writer's guide to diacritics and special characters". Text Wizard.
- ^ Describing these as homoglyphs is questionable as there are probably no languages in which the glyph can fulfil both these roles. It would be just as valid to describe, say, a grave accent as a homoglyph because it fulfils different roles in different languages.
- ^ ISBN 978-1-4673-2543-1.
- ^ Nigel Tao, Chuck Bigelow, and Rob Pike. Go fonts: DIN Legibility Standard". 2016.
- ISBN 9780367581565.
The types used by Caxton and his contemporaries originated in Holland and Belgium, and did not provide for the continuing use of elements of the Old English alphabet such as thorn <þ>, eth <ð>, and yogh <ʒ>. The substitution of visually similar typographic forms has led to some anomalies which persist to this day in the reprinting of archaic texts and the spelling of regional words. The widely misunderstood 'ye' occurs through a habit of printer's usage that originates in Caxton's time, when printers would substitute the <y> (often accompanied by a superscript <e>) in place of the thorn <þ> or the eth <ð>, both of which were used to denote both the voiced and non-voiced sounds, /ð/ and /θ/ (Anderson, D. (1969) The Art of Written Forms. New York: Holt, Rinehart and Winston, p 169)
- ^ "UTR #36: Unicode Security Considerations". unicode.org.
- ^ "Register a .CA in French!". Archived from the original on 2013-03-28. Retrieved 2013-03-29.
- ^ "ICANN Email Archives: [idn-guidelines]". forum.icann.org.
- ^ https://governance.dev/phishing-domain-check, accessed on February 12, 2024
External links
- https://www.unicode.org/Public/security/latest/confusables.txt - recommended confusable mapping for IDN.