6. Using Unicode with MySQL++

6.1. A Short History of Unicode

...with a focus on relevance to MySQL++

In the old days, computer operating systems only dealt with 8-bit character sets. This only gives you 256 possible characters, but the modern Western languages have more characters combined than that by themselves. Add in all the other lanauges of the world, plus the various symbols people use, and you have a real mess! Since no standards body held sway over things like international character encoding in the early days of computing, many different character sets were invented. These character sets weren't even standardized between operating systems, so heaven help you if you needed to move localized Greek text on a Windows machine to a Russian Macintosh! The only way we got any international communication done at all was to build standards on the common 7-bit ASCII subset. Either people used approximations like a plain "c" instead of the French "ç", or they invented things like HTML entities ("ç" in this case) to encode these additional characters using only 7-bit ASCII.

Unicode solves this problem. It encodes every character in the world, using up to 4 bytes per character. The subset covering the most economically valuable cases takes two bytes per character, so most Unicode-aware programs deal in 2-byte characters, for efficiency.

Unfortunately, Unicode came about two decades too late for Unix and C. Converting the Unix system call interface to use multi-byte Unicode characters would break all existing programs. The ISO lashed a wide character sidecar onto C in 1995, but in common practice C is still tied to 8-bit characters.

As Unicode began to take off in the early 1990s, it became clear that some sort of accommodation with Unicode was needed in legacy systems like Unix and C. During the development of the Plan 9 operating system (a kind of successor to Unix) Ken Thompson invented the UTF-8 encoding. UTF-8 is a superset of 7-bit ASCII and is compatible with C strings, since it doesn't use 0 bytes anywhere as multi-byte Unicode encodings do. As a result, many programs that deal in text will cope with UTF-8 data even though they have no explicit support for UTF-8. (Follow the last link above to see how the design of UTF-8 allows this.)

The MySQL database server comes out of the Unix/C tradition, so it only supports 8-bit characters natively. All versions of MySQL could store UTF-8 data, but sometimes the server actually needs to understand the data; when sorting, for instance. To support this, explicit UTF-8 support was added to MySQL in version 4.1.

Because MySQL++ does not need to understand the text flowing through it, it neither has nor needs explicit UTF-8 support. C++'s std::string stores UTF-8 data just fine. But, your program probably does care about the text it gets from the database via MySQL++. The remainder of this chapter covers the choices you have for dealing with UTF-8 encoded Unicode data in your program.

6.2. Unicode and Unix

Modern Unices support UTF-8 natively. Red Hat Linux, for instance, has had system-wide UTF-8 support since version 8. This continues in the Enterprise and Fedora forks of Red Hat Linux, of course.

On such a Unix, the terminal I/O code understands UTF-8 encoded data, so your program doesn't require any special code to correctly display a UTF-8 string. If you aren't sure whether your system supports UTF-8 natively, just run the simple1 example: if the first item has two high-ASCII characters in place of the "ü" in "Nürnberger Brats", you know it's not handling UTF-8.

If your Unix doesn't support UTF-8 natively, it likely doesn't support any form of Unicode at all, for the historical reasons I gave above. Therefore, you will have to convert the UTF-8 data to the local 8-bit character set. The standard Unix function iconv() can help here. If your system doesn't have the iconv() facility, there is a free implementation available from the GNU Project. Another library you might check out is IBM's ICU. This is rather heavy-weight, so if you just need basic conversions, iconv() should suffice.

6.3. Unicode and Win32

Each Win32 API function that takes a string actually has two versions. One version supports only 1-byte "ANSI" characters (a superset of ASCII), so they end in 'A'. Win32 also supports the 2-byte subset of Unicode called UCS-2. Some call these "wide" characters, so the other set of functions end in 'W'. The MessageBox() API, for instance, is actually a macro, not a real function. If you define the UNICODE macro when building your program, the MessageBox() macro evaluates to MessageBoxW(); otherwise, to MessageBoxA().

Since MySQL uses UTF-8 and Win32 uses UCS-2, you must convert data going between the Win32 API and MySQL++. Since there's no point in trying for portability — no other OS I'm aware of uses UCS-2 — you might as well use native Win32 functions for doing this translation. The following code is distilled from utf8_to_win32_ansi() in examples/util.cpp:

void utf8_to_win32_ansi(const char* utf8_str, char* ansi_str, int ansi_len)
{
    wchar_t ucs2_buf[100];
    static const int ub_chars = sizeof(ucs2_buf) / sizeof(ucs2_buf[0]);

    MultiByteToWideChar(CP_UTF8, 0, utf8_str, -1, ucs2_buf, ub_chars);
    CPINFOEX cpi;
    GetCPInfoEx(CP_OEMCP, 0, &cpi);
    WideCharToMultiByte(cpi.CodePage, 0, ucs2_buf, -1,
            ansi_str, ansi_len, 0, 0);
}

The examples use this function automatically on Windows systems. To see it in action, run simple1 in a console window (a.k.a. "DOS box"). The first item should be "Nürnberger Brats". If not, see the last paragraph in this section.

utf8_to_win32_ansi() converts utf8_str from UTF-8 to UCS-2, and from there to the local code page. "Waitaminnit," you shout! "I thought we were trying to get away from the problem of local code pages!" The console is one of the few Win32 facilities that doesn't support UCS-2 by default. It can be put into UCS-2 mode, but that seems like more work than we'd like to go to in a portable example program. Since the default code page in most versions of Windows includes the "ü" character used in the sample database, this conversion works out fine for our purposes.

If your program is using the GUI to display text, you don't need the second conversion. Prove this to yourself by adding the following to utf8_to_win32_ansi() after the MultiByteToWideChar() call:

MessageBox(0, ucs2_buf, "UCS-2 version of Item", MB_OK);

All of this assumes you're using Windows NT or one of its direct descendants: Windows 2000, Windows XP, Windows 2003 Server, and someday Windows Vista. Windows 95/98/ME and Windows CE do not support UCS-2. They still have the 'W' APIs for compatibility, but they just smash the data down to 8-bit and call the 'A' version for you.

6.4. For More Information

The Unicode FAQs page has copious information on this complex topic.

When it comes to Unix and UTF-8 specific items, the UTF-8 and Unicode FAQ for Unix/Linux is a quicker way to find basic information.