A technical challenge: mini-Unicode in 4096 characters?

The place to discuss alphabets and other writing systems.

A technical challenge: mini-Unicode in 4096 characters?

Postby gsteemso » Sun 27 Sep 2009 10:47 am

Hi there!

I am a retrocomputing hobbyist who has gotten sidetracked from designing a homebrew CPU, into questions of what writing systems the finished machine ought to be able to handle. I am using a 12-bit byte size (why? ’coz it’s fun!) and a text-screen mode modelled loosely after those of the Commodore 128, which means I can display any mixture of up to 4096 glyphs, in 12x16 pixel display cells. Please note that these code points specify glyphs, not characters — a single Arabic letter would eat as many four of them for positional variants; even more with vowel diacritics added.

Here’s the catch — With the intention of someday learning as many of the associated languages as possible, I’d like the machine to be able to handle a minimal but useable subset of any of the writing systems commonly found in the Seattle, WA / Vancouver, BC area of North America. That list contains at least eleven different writing systems, maybe more (see the list below); and since three of them are Japanese and Traditional and Simplified Chinese, fitting a useable subset into only 4096 glyphs seems problematic at best.

I had been thinking of having a second glyph set of the same size, selectable by flipping a flag in a register somewhere. That would allow me to choose from among 8192 glyphs to put on screen at any given time, albeit in a rather clumsy manner. Still, it works for the Commodore 128. Alternatively, it occurs to me that I could do something with a variable-length code and double-width characters for the ones that take up two bytes. I think that's the technique they used in Asian releases of early 80s microcomputers, but with 12-bit bytes that makes for a potentially enormous amount of ROM needed to store the screen font. In the extreme case, if every glyph took two bytes (24 bits) I’d have more code points than Unicode by a factor of over 900% — plainly absurd.

Alphabetical list of what I’d like to cram into 4096 code points:

• Arabic, for Persian (among others?) — will need a lot of extra glyphs due to positional variation of letterforms

• Cyrillic, for Russian (among others?)

• Devanagari, for Hindi — a beautiful writing system but very complex; how did the Indians represent it in the 80s?

• Gurmukhi, for Punjabi

• Japanese (at least the Kana) — Japanese newspapers stick to a list of the 1945 most common Kanji; would those be enough?

• Korean (combining Jamo similar to that in Unicode only; if the Koreans don't really need Hanja, I don’t either)

• Latin, for Tagalog, Vietnamese, and various European and First Nations languages — many, many accented forms

• Simplified and Traditional Chinese (Bopomofo might also be good, or at least less visually jarring than Pinyin, to fill in for missing Hanzi) — each said to need ≈3000–6000 Hanzi depending on the target audience, albeit with considerable overlap (but how much?)

• Thai — complex, but not nearly as bad as Arabic or Devanagari

• And, of course, several dozen punctuation characters, mathematical signs, and so on.

• I’d also like to include a subset of Canadian Syllabics (at least big enough to write Inuktitut), for no other reason than my own curiosity.

• Technically, Cambodian has more right to be on this list than some of the ones already discussed, but I have no interest in learning it. Why? darned if I know. Perhaps it is the similarity in some ways to Thai? Likewise, Cantonese speakers outnumber Mandarin (Han) speakers in the city of Vancouver by two to one, but as of now, I don’t intend to learn more than one of the two languages — probably Mandarin, as I’d apparently need it to understand most written Chinese. Cantonese apparently needs over a thousand extra Hanzi to write, too.

All of the above seems to add up to only about 1500–3500 glyphs, until you get to the several types of Hanzi and the Kanji. Even if a substantial portion can be “unified” as Unicode does, which is questionable given that there will only be one display font, that there is the budget-buster. Has anyone got any advice?

Hopefully,
gsteemso
The world’s only gsteemso
gsteemso
 
Posts: 20
Joined: Fri 14 Aug 2009 2:43 am
Location: near Seattle, WA

Re: A technical challenge: mini-Unicode in 4096 characters?

Postby Stosis » Wed 30 Sep 2009 4:25 pm

Impressive to say the least. When you say 12 bit bytes does that mean that you'd have 12 digits instead of 8 (011001001001 instead of 01100110)? You could be thrifty and reuse similar looking letters or letters in closely related writing systems that basically look the same. By this I mean if you decided to have the Greek and Latin alphabets you could leave out the areas in which they overlap or where the only difference is something like a serif (compare Latin i and Greek ι). You'll end up with some odd looking fonts but it might help.

I'm not sure how these things work but is it possible to reuse accents/diacritics by "building" the graphs. First input the letter you want accented and the push the button for the accent, reusing the same glyph.

Is it possible to have more of these switches (or maybe a dial would work better)? You could simply have a position on the dial for each writing system and switch them at will.
Stosis
 
Posts: 71
Joined: Sun 19 Apr 2009 11:32 am

Re: A technical challenge: mini-Unicode in 4096 characters?

Postby gsteemso » Wed 30 Sep 2009 10:43 pm

Ergh. I didn’t even think about Greek, but it’s so heavily used in science that I can’t really get by without it. OK, so that’s another 60 glyphs or so… *groan*

You understand correctly about the byte length. “Bit” is short for Binary digIT, and binary is the number system that uses all zeros and ones. “12 bits” is short for “12 binary digits”.

Combining visually similar glyphs into a single code point is a good idea, but there are fewer of them than you might think — certainly not enough to solve the Hanzi/Kanji problem, in any case.

The accent overlay thing is doable, and if it worked it would help a bit, but I think you’re getting confused between what the user types and what the computer stores for display. Normally the user would type something to indicate that the next (or previous, depending on the implementing engineers’ whim) letter he or she types is to have a certain accent. For example, on a Mac with a US keyboard layout, you type option+E to indicate the next letter should have an acute accent on it, then (for example) an “a” to produce “á”. You select which script system you’re entering by means of a menu item. That allows the 26 letter keys on your keyboard to produce any of several thousand different characters on screen — but the machine still needs to store a value capable of distinguishing between all those thousands of different characters.

In other words, there is a difference between what the user sees and what is stored internally. For example, if the above-referenced “a with acute” was being typed into a Unicode text file, it could be stored as either “a with acute” or the sequence (“a”, “combining acute accent”). However, for an 80’s-style text screen, there is a one-to-one correspondence between the storage code and the glyph displayed in the corresponding cell on screen.

There are three ways to do overlaying of accent shapes on the “base characters” that would be encoded:

1) Have a separate storage the same size as the main one for indicating which accent to overlay at each cell on the screen. Pro: Uses a separate repertoire for accents, thus saving that many code points in the main repertoire of “base character” glyphs. Cons: Not very flexible and eats a lot of storage, especially if the screen is large. (My plan is to output an HDTV signal for video, so we're dealing with several thousand characters per text screen here.)

2) Just display each accent in the cell following the cell of the character it goes on, with the understanding that printed output would be less clunky than screen output. Pros: Simple and easy to implement. Cons: Confusing, ugly, isn’t WYSIWYG, and what do you do if the “base character” is at the far edge of the screen and thus has no following cell?

3) Make all accented characters be two display cells wide. The code in the first cell’s memory slot would indicate that the following code was to be interpreted as a special sub-code for (precomposed) accented glyphs. Pro: Effectively unlimited code space, can have all the glyphs you want. Cons: Any character with a two-byte code will be displayed smeared across two display cells. This is OK, even helpful, for Kanji and Hanzi, but looks reeeeeeally damn ugly for Latin characters like the “á” from the example above, especially on my proposed machine, where even a single cell would be 12 pixels wide. For comparison, a single 12-pixel cell is already twice as wide as your average lowercase Latin letter, at least for a 10–12 point type size.

I think I’m going to have to go with the variable-length code idea (number 3 above), which solves one problem and raises several others. Maybe I can fit all the precomposed accented characters in the first 4000 or so code points? Will need to research that.
The world’s only gsteemso
gsteemso
 
Posts: 20
Joined: Fri 14 Aug 2009 2:43 am
Location: near Seattle, WA


Return to Writing systems

Who is online

Users browsing this forum: No registered users and 9 guests