Sunday, September 7, 2008

(Almost) Everything I Know About Character Sets

Here is another issue which recently came up at work. Since I seem to get asked these questions about every year or so, I think I will write them up once and for all. To make this fit into my computer's memory, I'm going to be very imprecise, and use the term "character set" in places where I really should use "code point" or "encoding" or "code page". And I am never going to use the term "glyph". If you already understand the above, you probably do not need to read the rest.

But first, a little fable. Mr. A is about to insert a row of text into an Oracle database. Exactly what that text consists of is not important, other than the 42nd character of the text is the euro symbol.

And I'm not going to try to write this symbol on this page, as the chances that the reader will not be able to see it are good. Just keep in mind that Mr. A has attempted to insert a euro character into an Oracle table


Now, Mr. A is doing his work running SQL*Plus on Linux. He is using the ISO8859-15 character set. This is a modification of the original ISO8859-1 character set, contains 256 "characters", and is a superset of the old US ASCII 7 bit character set. It is sometimes referred to (incorrectly) as "8 bit ASCII". If we were to do a memory probe of his session, we would find that the 42nd character of the text is coded as A4 hex (164 decimal).

The database into which this text gets inserted is using the UTF8 character set. In this character set, every character is coded using from 1 to 6 bytes. And it happens that the euro is supposed to be represented by the bytes "20 AC". In UTF8, an A4 as a single byte character represents the universal default currency symbol.

No sooner is the row committed, than Ms B decides to fetch it. Her Windows system uses the Win-1252 character set, where the euro is represented by a hex 80 (128 decimal). Yet, she still manages to view the euro symbol correctly.

Clearly, there is a whole lot of conversion going on.

The rest of this discussion is best delivered in a Q & A format, beginning with Question # 1:

Q1: So, how does the database manage all those conversions?

A: The database does not manage any conversions; they are all performed by the clients. Mr A's and Ms. B's client software (sql*plus, Toad, Sql Developer, some home grown application) all know what character set their system is using and what character set the database is using, and it converts accordingly.


Q2:
The Oracle Client software must be pretty smart. After reading your fable, I've been inspired to try to figure out what character set I am using, and I realize that I don't have the foggiest notion of where to start. Yet, Oracle is able to do this on the fly. Any hints for me?

A: Don't feel so bad; Oracle doesn't know what you are actually using. It just knows what you told it you were using.

Q3: You've outdone yourself in obscurity. Can you please expand on that last point?

A: The simple truth is, there is no way to know what characters a string of bytes means unless you also know the character set that string has been encoded with. In order for Oracle to understand what the text, you must give it this information. This is done via the NLS_LANG variable. You set the NLS_LANG environment variable, and this tells the client software what character set you are using. If you fail to set the NLS_LANG variable, the client will default to 7 bit US ASCII (more precisely, AMERICAN_AMERICA.US7ASCII). The only way to avoid character conversion problems is to (a) know what character set your system is using and (b) set an appropriate NLS_LANG value.

Q4: Why can't I just ask my DBA what character set the database is using, and use that for my NLS_LANG?

A: Because your DBA might just play a trick on you and give you an honest answer. Then, you are in all sorts of trouble. Because you didn't follow the answer to Q3.

Actually, there is a more complete answer to this question. As you should have deduced by now, the Oracle client has to do a lot of work when it has to do character set conversions. So, Oracle introduced a little optimization: If the client sees that the database value of NLS_LANG is the same as it's NLS_LANG, no conversion takes place. Now, before you run off and think that all your issues have been resolved, consider the implications of a character set mismatch.

Same situation as my Fable, except that Mr. A has set NLS_LANG to the same value as the database. Now when he inserts his text, the 42nd character does not get inserted as "20 AC" but as "A4". And if he comes back tomorrow and selects it, it will come back as a euro symbol. That is because, on his personal system, "A4" codes the euro symbol.

But, what happens to poor Ms. B? Her client is going to fetch the same "A4" from the database, but it is going to run through the conversion routine, realize that in UTF8, "A4" encodes a generalized currency symbol, convert it to the correct code for Ms. B (unfortunately, it also is A4; I wasn't able to get every possible twist and turn here) and present the wrong character to her.

Q5: If Windows can figure out what character set I am using just by looking at my file, why can't Oracle?

A: Because Bush hid the truth?

Seriously.

There is a well documented bug (do a search on "Bush hid the truth") in Windows 2000/XP which exposes what happens when Windows tries to do exactly that. You can see this for yourself. Create a .txt document in Notepad, containing just the one line "Bush hid the truth" (there are other lines which also work, but this one is the most famous). Save it and reopen it. You will find yourself looking at 9 Chinese characters. And why this happens is instructive.

As I said before, the plain fact is, given a what is supposed to be a plain text file, one has no idea how to interpret it unless one is told what character set that file is using. And there is no place to store this metadata in a Windows .txt file (or in *nix for that matter). So, when Windows tries to open up that file, it must guess at it's character set. In the case of a text file containing just the windows 1252 characters for "Bush hid the truth", it guesses wrong.

Q6: I've got a web server, an application server, a database server, users in the USA, Canada, Germany, what am I to do?

A: And you thought Jack Bauer had a tough life.

O.K., here's the way things are supposed to work, with some comments on where things can go wrong.

First of all, nobody just enters data into a web page; one really enters data into a web form. To get that web form displayed in your web browser, you must request it from a web server. And that web server, if it is configured correctly, is going to untangle this web I have woven by doing two things:

a) It is going to send a Content-Type header containing the character set name to your browser before it sends your page.

b) That page is going to contain a similar header as a meta tag right at the beginning, just in case the web server was not configured correctly.

Armed with this bit of Inside Information, your browser knows how to interpret the text on the web page. And it knows what character set to use when it returns the form after you have filled it out.

Now, there are two things which can go wrong here. First, the webmaster may have failed to implement points (a) and (b) above, in which case your browser has no idea what character set it should use. In this case, the results are browser dependent. With Firefox, you set a default character set, which is used if the web page does not specify one. Internet Explorer tries to deduce which set to use. Usually it gets it right (except when Bush hides the truth).

The second thing which can go wrong is that the user can decide to enter text using a different character set than the web page is expecting. Suppose the web page headers specify 8859-1. Suppose the user is cutting and pasting from some European character set. What gets sent back to the web server are the byte strings for those European characters, which have different meanings in the 8859-1 character set. So user thinks he has entered a euro, and the web server thinks he has entered a currency sign.

Meanwhile, back at the data center, the application is trying to insert this data in an Oracle database. It should be clear by now that the application should have set NLS_LANG to be consistent with whatever character set the web server is trying to use. If it is, then the data should be entered into the database consistently. Or, at least, if there are any character conversion errors, they can be blamed on the user and the webmaster.

Once the data gets entered correctly, viewing it is a simple matter of getting NLS_LANG set up correctly on your client software. And, in this case, the client is the web server (which was set up correctly in the last paragraph). If you are having a problem viewing what your users are viewing when you use Toad, you need to see if you are using the same NLS_LANG that your users are using.

For further reading, I recommend starting with these three sources:

Character sets in general: the Joel on Software article on character sets

Character sets and the web: this Sitepoint article

How Oracle handles them: The Oracle Globalization manual for your database version.