I am a retrocomputing hobbyist who has gotten sidetracked from designing a homebrew CPU, into questions of what writing systems the finished machine ought to be able to handle. I am using a 12-bit byte size (why? ’coz it’s fun!) and a text-screen mode modelled loosely after those of the Commodore 128, which means I can display any mixture of up to 4096 glyphs, in 12x16 pixel display cells. Please note that these code points specify glyphs, not characters — a single Arabic letter would eat as many four of them for positional variants; even more with vowel diacritics added.
Here’s the catch — With the intention of someday learning as many of the associated languages as possible, I’d like the machine to be able to handle a minimal but useable subset of any of the writing systems commonly found in the Seattle, WA / Vancouver, BC area of North America. That list contains at least eleven different writing systems, maybe more (see the list below); and since three of them are Japanese and Traditional and Simplified Chinese, fitting a useable subset into only 4096 glyphs seems problematic at best.
I had been thinking of having a second glyph set of the same size, selectable by flipping a flag in a register somewhere. That would allow me to choose from among 8192 glyphs to put on screen at any given time, albeit in a rather clumsy manner. Still, it works for the Commodore 128. Alternatively, it occurs to me that I could do something with a variable-length code and double-width characters for the ones that take up two bytes. I think that's the technique they used in Asian releases of early 80s microcomputers, but with 12-bit bytes that makes for a potentially enormous amount of ROM needed to store the screen font. In the extreme case, if every glyph took two bytes (24 bits) I’d have more code points than Unicode by a factor of over 900% — plainly absurd.
Alphabetical list of what I’d like to cram into 4096 code points:
• Arabic, for Persian (among others?) — will need a lot of extra glyphs due to positional variation of letterforms
• Cyrillic, for Russian (among others?)
• Devanagari, for Hindi — a beautiful writing system but very complex; how did the Indians represent it in the 80s?
• Gurmukhi, for Punjabi
• Japanese (at least the Kana) — Japanese newspapers stick to a list of the 1945 most common Kanji; would those be enough?
• Korean (combining Jamo similar to that in Unicode only; if the Koreans don't really need Hanja, I don’t either)
• Latin, for Tagalog, Vietnamese, and various European and First Nations languages — many, many accented forms
• Simplified and Traditional Chinese (Bopomofo might also be good, or at least less visually jarring than Pinyin, to fill in for missing Hanzi) — each said to need ≈3000–6000 Hanzi depending on the target audience, albeit with considerable overlap (but how much?)
• Thai — complex, but not nearly as bad as Arabic or Devanagari
• And, of course, several dozen punctuation characters, mathematical signs, and so on.
• I’d also like to include a subset of Canadian Syllabics (at least big enough to write Inuktitut), for no other reason than my own curiosity.
• Technically, Cambodian has more right to be on this list than some of the ones already discussed, but I have no interest in learning it. Why? darned if I know. Perhaps it is the similarity in some ways to Thai? Likewise, Cantonese speakers outnumber Mandarin (Han) speakers in the city of Vancouver by two to one, but as of now, I don’t intend to learn more than one of the two languages — probably Mandarin, as I’d apparently need it to understand most written Chinese. Cantonese apparently needs over a thousand extra Hanzi to write, too.
All of the above seems to add up to only about 1500–3500 glyphs, until you get to the several types of Hanzi and the Kanji. Even if a substantial portion can be “unified” as Unicode does, which is questionable given that there will only be one display font, that there is the budget-buster. Has anyone got any advice?
The world’s only gsteemso