Databases, SQL and Foreign Character Sets: It'll be O.K. in the N
by Stefan Hudson
One of the challenges faced when creating SimplyLingo
was the storage, updating and retrieval of foreign character sets in a SQLServer database
using SQL. The compilation of 6,000 words and simple phrases in each language, although
time consuming and laborious, was technically easily achieved using an Excel spreadsheet.
It soon became apparent that transferring to and retrieving from a database was not so easy.
This short article is intended to assist inexperienced web developers or those not familiar
with combining databases and foreign words. Some elementary knowledge of SQL and database
development is assumed. If you don't have that knowledge and would like to learn you can
take a free on-line course here.
Standard character sets are not appropriate for most languages (esp. Greek) so it is
necessary to use UTF-8 (Unicode) character encoding to provide the representations required
for the full language sets, including accents. So when choosing a field-type in SQLServer
for a foreign word the normal "varchar" types are inadequate. What is needed is the "nvarchar"
types which take up twice as much space as the equivalent "varchar" types thus providing
more space for the additional combinations of the foreign sets. This then is the first
decision to make.
When you store foreign words use nvarchar types not varchar
Most website hosts provide tools for maintaining the database and many of these provide
a tool for bulk adding to a table from a comma delimited file (csv). If your web hosts don't
provide this then either change to another who does or learn how to add entries programmatically.
(Some ideas about this below).
So assuming you have a tool like this there is still one or two things needed to make it work.
First, you need to save your spreadsheet file in a text format using Unicode text. Usually
this is achieved by scrolling down to this type on the Save page in Excel. This probably won't
produce a csv file (comma separated variables) but a text file with fields separated by tab
characters. This means that as you import the file you will have to specify the file as TAB
delimited not comma. This is normally an option in the import tool.
You can inspect the import results using the query tool from the web hosts with a statement such as:
SELECT * FROM tablename;
Well and good if you can report all the rows you imported. If not you will need to find
the reason the import failed. Some of the reasons I came across at SimplyLingo:
- Null fields not specified as such.
- Unicode fields not using nvarchar types
- Wrong number of columns
- Null fields at end of row when last cell left blank resulting in fewer than expected columns
- Numeric fields trying to update character fields and vice versa
So, now you're ready to use the data and you write some code similar to this:
SELECT field1, field2….. FROM tablename WHERE field3 = ‘value';
Shock, horror – either the query does not work at all or it retrieves rows where many
of the columns are filled with "???????????????". - Why?
The SQL code needs to be told that the columns are Unicode.
The above code should be written thus:
Select field1, field2…. FROM tablename WHERE field3 = N'value';
The N tells SQL to expect Unicode characters.
Similarly if you wish to insert rows in the database use the following code (this is
what you must do to import if your database maintenance tool does not have an import function):
INSERT INTO tablename
(first-column, ... last-column)
VALUES (N'string1', N'string2', numeric1, N'string3'…);
The N is used each time the field is defined as Unicode and must immediately precede the quote.
Similarly for updates:
SET column1 = N‘string1',
Column2 = N'string2',
Column3 = numeric1…..
WHERE columnname = N'string';
Although the above seem relatively small tips, in reality if you are not used to dealing
with foreign character sets and are not experienced they could save you hours, days or even
weeks of trial-and-error and frustration. But as I found, it all comes good in the N.
About the Author
Stefan Hudson is a retired computer professional who has tried learning languages himself
(notably Greek). After using home-made flash cards for some time and with spare time on his
hands he developed the vocabulary training web site SimplyLingo.
SimplyLingo is a language learning vocabulary training web site. It provides up to 6,000
words and simple phrases in English, French, Greek and Spanish in an easy to use methodology.