Powered by
TTSReader
Share this page on
Article provided by Wikipedia


Main article: "Japanese language and computers

In "Japanese, the phenomenon is, as mentioned, called mojibake (文字化け"?). It is a particular problem in Japan due to the numerous different encodings that exist for Japanese text. Alongside Unicode encodings like UTF-8 and UTF-16, there are other standard encodings, such as "Shift-JIS (Windows machines) and "EUC-JP (UNIX systems). Mojibake, as well as being encountered by Japanese users, is also often encountered by non-Japanese when attempting to run software written for the Japanese market.

Chinese[edit]

In "Chinese, the same phenomenon is called Luàn mǎ ("Pinyin, "Simplified Chinese 乱码, "Traditional Chinese 亂碼, meaning chaotic code), and can occur when computerised text is encoded in one "Chinese character encoding but is displayed using the wrong encoding. When this occurs, it is often possible to fix the issue by switching the character encoding without loss of data. The situation is complicated because of the existence of several Chinese character encoding systems in use, the most common ones being: "Unicode, "Big5, and "Guobiao (with several backward compatible versions), and the possibility of Chinese characters being encoded using Japanese encoding.

It's easy to identify the original encoding when luanma occurs in Guobiao encodings:

Original encoding Example Original text Note
Big5 瓣в眏 三國志11威力加強版 Lots of blank or undisplayable characters with occasional Chinese characters
Shift-JIS 暥帤壔偗僥僗僩 文字化けテスト Kana is displayed as characters with the radical 亻, while kanji are other characters. Most of them are extremely uncommon and not in practical use in modern Chinese.
EUC-KR 叼力捞钙胶 抛农聪墨 디제이맥스 테크니카 Random common Simplified Chinese characters which in most cases make no sense. Easily identifiable because of spaces between every several characters.

An additional problem is caused when encodings are missing characters, which is common with rare or antiquated characters that are still used in personal or place names. Examples of this are "Taiwanese politicians "Wang Chien-shien (Chinese: 王建煊; pinyin: Wáng Jiànxuān)'s "煊", "Yu Shyi-kun (simplified Chinese: 游锡堃; traditional Chinese: 游錫堃; pinyin: Yóu Xíkūn)'s "堃" and singer "David Tao (Chinese: 陶喆; pinyin: Táo Zhé)'s "喆" missing in "Big5, ex-PRC Premier "Zhu Rongji (Chinese: 朱镕基; pinyin: Zhū Róngjī)'s "镕" missing in "GB2312, "copyright symbol "©" missing in "GBK.[8]

Newspapers have dealt with this problem in various ways, including using software to combine two existing, similar characters; using a picture of the personality; or simply substituting a homophone for the rare character in the hope that the reader would be able to make the correct inference.

Indic text[edit]

A similar effect can occur in "Brahmic or Indic scripts of "South Asia, used in such "Indo-Aryan or Indic languages as "Hindustani (Hindi-Urdu), "Bengali, "Punjabi, "Marathi, and others, even if the character set employed is properly recognized by the application. This is because, in many Indic scripts, the rules by which individual letter symbols combine to create symbols for syllables may not be properly understood by a computer missing the appropriate software, even if the glyphs for the individual letter forms are available.

A particularly notable example of this is the old "Wikipedia logo, which attempts to show the character analogous to "wi" (the first syllable of "Wikipedia") on each of many puzzle pieces. The puzzle piece meant to bear the "Devanagari character for "wi" instead used to display the "wa" character followed by an unpaired "i" "modifier vowel, easily recognizable as mojibake generated by a computer not configured to display Indic text.[9] The logo as redesigned as of May 2010 has fixed these errors.

The idea of Plain Text requires the operating system to provide a font to display Unicode codes. This font is different from OS to OS for Singhala and it makes orthographically incorrect glyphs for some letters (syllables) across all operating systems. For instance, the 'reph', the short form for 'r' is a diacritic that normally goes on top of a plain letter. However, it is wrong to go on top of some letters like 'ya' or 'la' but it happens in all operating systems. This appears to be a fault of internal programming of the fonts. In Macintosh / iPhone, the muurdhaja l (dark l) and 'u' combination and its long form both yield wrong shapes.

Some Indic and Indic-derived scripts, most notably "Lao, were not officially supported by "Windows XP until the release of "Vista.[10] However, various sites have made free-to-download fonts.

African languages[edit]

In certain "writing systems of Africa, unencoded text is unreadable. Texts that may produce mojibake include those from the "Horn of Africa such as the "Ge'ez script in "Ethiopia and "Eritrea, used for "Amharic, "Tigre, and other languages, and the "Somali language, which employs the "Osmanya alphabet. In "Southern Africa, the "Mwangwego alphabet is used to write languages of "Malawi and the "Mandombe alphabet was created for the "Democratic Republic of the Congo, but these are not generally supported. Various other writing systems native to "West Africa present similar problems, such as the "N'Ko alphabet, used for "Manding languages in "Guinea, and the "Vai syllabary, used in "Liberia.

Arabic[edit]

Another affected language is "Arabic (see below). The text becomes unreadable when the encodings do not match.

Examples[edit]

File encoding Setting in browser Result
Arabic example: ("Universal Declaration of Human Rights)
Browser rendering: الإعلان العالمى لحقوق الإنسان
"UTF-8 "Windows-1252 اÙ"إعÙ"ان اÙ"عاÙ"مى Ù"Øقوق اÙ"إنسان
"KOI8-R О╩©ь╖ы└ь╔ь╧ы└ь╖ы├ ь╖ы└ь╧ь╖ы└ы┘ы┴ ы└ь╜ы┌ы┬ы┌ ь╖ы└ь╔ы├ьЁь╖ы├
"ISO 8859-5 яЛПиЇй�иЅиЙй�иЇй� иЇй�иЙиЇй�й�й� й�ий�й�й� иЇй�иЅй�иГиЇй�
"CP 866 я╗┐╪з┘Д╪е╪╣┘Д╪з┘Ж ╪з┘Д╪╣╪з┘Д┘Е┘Й ┘Д╪н┘В┘И┘В ╪з┘Д╪е┘Ж╪│╪з┘Ж
"ISO 8859-6 ُ؛؟ظ�ع�ظ�ظ�ع�ظ�ع� ظ�ع�ظ�ظ�ع�ع�ع� ع�ظع�ع�ع� ظ�ع�ظ�ع�ظ�ظ�ع�
"ISO 8859-2 اŮ�ŘĽŘšŮ�اŮ� اŮ�ؚاŮ�Ů�Ů� Ů�ŘŮ�Ů�Ů� اŮ�ŘĽŮ�ساŮ�
"Windows-1256 "Windows-1252 ÇáÅÚáÇä ÇáÚÇáãì áÍÞæÞ ÇáÅäÓÇä
Swedish example: Smörgås ("Open sandwich)
File encoding Setting in browser Result
"MS-DOS 437 "ISO 8859-1 Sm"rg†s
ISO 8859-1 "Mac Roman SmˆrgÂs
UTF-8 ISO 8859-1 Smörgås
UTF-8 Mac Roman Smörgås
Russian example: Кракозябры (krakozyabry, garbage characters)
File encoding Setting in browser Result
"MS-DOS 855 "ISO 8859-1 Æá ÆÖóÞ¢áñ
"KOI8-R ISO 8859-1 ëÒÁËÏÚÑÂÒÙ
UTF-8 "KOI8-R п я─п╟п╨п╬п╥я▐п╠я─я▀

See also[edit]

While failure to apply this transformation is a vulnerability (see "cross-site scripting), applying it too many times results in garbling of these characters. For example, the quotation mark " becomes ", ", " and so on.

References[edit]

  1. ^ a b "Will Unicode soon be the universal code?" IEEE Spectrum, vol. 49, issue 7, p. 60 (July 2012). The advantage of Unicode is that if everyone adopted it, it would eradicate the problem of mojibake, Japanese for "character transformation." Mojibake is the jumble that results when characters are encoded in one system but decoded in another.
  2. ^ "Guidelines for extended attributes". 2013-05-17. Retrieved 2015-02-15. 
  3. ^ "Unicode mailinglist on the Eudora email client". 2001-05-13. Retrieved 2014-11-01. 
  4. ^ p. 141, Control + Alt + Delete: A Dictionary of Cyberslang, Jonathon Keats, Globe Pequot, 2007, "ISBN 1-59921-039-8.
  5. ^ "Usage of Windows-1251 for websites". 
  6. ^ "Declaring character encodings in HTML". 
  7. ^ "sms-scam". June 18, 2014. Retrieved June 19, 2014. 
  8. ^ "PRC GBK (XGB)". Archived from the original on 2002-10-01.  Conversion map between "Code page 936 and Unicode. Need manually selecting "GB18030 or "GBK in browser to view it correctly.
  9. ^ Cohen, Noam (June 25, 2007). "Some Errors Defy Fixes: A Typo in Wikipedia's Logo Fractures the Sanskrit". The New York Times. Retrieved July 17, 2009. 
  10. ^ "Content Moved (Windows)". Msdn.microsoft.com. Retrieved 2014-02-05. 

External links[edit]

) )