49972

How do you troubleshoot character encoding problems?

If all you see is the ugly no-char boxes, what tools or strategies do you use to figure out what went wrong?

(The specific scenario I'm facing is no-char boxes within a <select> when it should be showing Japanese chars.)

Answer1:

Firstly, "ugly no-char boxes" might not be an encoding problem, they might just be a sign you don't have a font installed that can display the glyphs in the page.

Most character encoding problems happen when strings are being passed from one system to another. For webapps, this is usually between the browser and the application, between the application and the filesystem and between the application and the database.

So you need to check where the mis-encoded data is coming from, what character encoding it has at the source, and what encoding it is being received as. The best way is to send through characters you know the system is having problems with, and examine them at each level of the app. What do they look like inside the app? In the database? When you get them back from the database? When they're displayed in the browser?

Sorry to be so general, but the question doesn't give much more to work with.

Answer2:

If the data you send to the browser becomes mangled (moji-bake) you will get trash characters. Also, if you specify the wrong character set in your META headers, your browser will render the page incorrectly, causing moji-bake again, sometimes in random places on the page.

When handling CJK character sets, you must be sure to use UTF8 character encoding throughout the lifetime of your program (data storage, retrieval, data manipulation in your code, displaying in the browsser etc...)

<strong>What is UTF8?</strong> UTF8 handles binary streams of data, not strings. This means the bit combinations can have variable length. ASCII characters have a fixed length of 8 bits representing 1 byte, however UTF8 characters can be composed of 6bits, 8bits, 12bits, etc... As such, UTF8 is prone to what Japanese call "mojibake".

As a coder, from database to codebase to browser, you should try and use UTF8 completely. For email you can use UTF8, but you will probably find most mail servers and clients are still old and use a mishmash of different character sets (e.g. ISO9022X).

<strong>Database Settings</strong> If you are a mysql user, then make sure you have to ensure all connections to the DB use UTF8, and that all tables/fields use UTF8. By default mysql uses Latin (Swedish) character sets. Those kooky swedes love their sense of humour!!

<strong>Checking your Codebase</strong> In my experience editors like Notepad++, Notepad2, UltraEdit, e, etc... all have UTF8 support problems. They mostly work, but since their developers don't use CJK languages themselves, they are not perfected. Issues like turning off BOM (Byte Order Mark), mangled tabs, poor character set conversion, etc ... all present problems.

I highly recommend using a proven UTF8 editor like Maruo. This is made by a Japanese company, but there is an English version (and a trial version) at http://www.hidemaru.interlink.or.jp/software/

Lastly, you may need to convert your source files into UTF8. Especially if the codebase itself has CJK language strings contained therein.

<strong>Manipulating Strings</strong> Any string function need to multibyte safe. Notice I didn't say double-byte. UTF8 is not a double byte but multibyte, depending on the total number of bits used to represent a character. In PHP you need to call the MB string functions specifically. Ruby and other languages have more transparent support, but you need to check the docs for your flavour of application server!

<strong>META Tags</strong> Check out google.co.jp or yahoo.co.jp for their META headers. These are sites that know how to to it properly. Basically include the following META tag the doucment <HEAD>

<meta http-equiv="content-type" content="text/html; charset=utf-8">

It is usually safe to mix English HTML document type attributes with the above character too. So adding the META tag above seems to work in a HTML document that has:

<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">

<strong>Email</strong> This is a wholly different can of worms. UTF8 works a lot, but many older Japanese clients use ISO2022X more. This is not worth covering here.

<strong>Debugging UTF8 Issues</strong> Once you have a reliable UTF8 editor like Maruo, you can create static pages and resolve your issues.

Hope that helps

Answer3:

Redirect the data to disk and use a Hex Editor. Most text editors / viewers do their own conversions behind the scenes, so it is difficult to be sure you are seeing the data in it's true form.

Recommend

  • adjacency list for graph build
  • Get angle from matrix
  • Using ui-router for “main” layout?
  • Test if a range intersects another range of numbers
  • Is a collocated join (a-la-netezza) theoretically possible in hive?
  • What rounding method should you use in Java for money?
  • nodejs tls session id
  • Why doesn't PowerShell Where-Object work when passing a variable?
  • Error getting audio input device sample rate: '!obj'
  • How to plot stacked proportional graph?
  • activity diagram - call operation example with parameters?
  • Fetch a tree with Neo4j
  • byte, char, int in Java - bit representation
  • spring data neo4j 3.0.0 - why two labels set by default
  • Intellij Idea Terminal shortcut not working
  • Merge arrays by common column values in julia
  • Extracting individual digits from a float
  • Jquery resizable reposition handle after scroll
  • Is there a way to link a linux's thread TID and a pthread_t “thread ID”
  • 'doc_del_count' bigger than 'doc_count' on CouchDB
  • Backward compatibility of Python 3.5 for external modules
  • A class implementing two different IObservables?
  • Get the number 18437736874454810627
  • Hibernate to update table schema
  • Debugging VB6 Code From Visual Studio 2010
  • GAE: Way to get reference to an HttpSession from its ID?
  • Abort upload large uploads after reading headers
  • Admob requires api-13 or later can I not deploy on old API-8 phones?
  • Allowing both email and username for authentication
  • Get one-time binding to work for ng-if
  • Does CUDA 5 support STL or THRUST inside the device code?
  • When should I choose bucket sort over other sorting algorithms?
  • Weird JavaScript statement, what does it mean?
  • How do you troubleshoot character encoding problems?
  • WOWZA + RTMP + HTML5 Playback?
  • Run Powershell script from inside other Powershell script with dynamic redirection to file
  • How to format a variable of double type
  • Understanding cpu registers
  • Django query for large number of relationships
  • How to Embed XSL into XML