At the TONE, it will not be TUNE, but TANE

Sorting it all Out
Michael Kaplan's random stuff of dubious value
Be sure to read the disclaimer here first!

At the TONE, it will not be TUNE, but TANE

  • Comments 22

I figure that since I initially posted about TUNE in And if your language starts playing a different TUNE that actually mentioned a meeting that would be happening in Tamil Nadu over the weekend that I should post some follow-up information on what happened....

This Yahoo group has a thread or two with lots of summary info, excerpted below.

Due to a last minute attempt to reach out the tamil community on the proposed unciode encoding a ONE day conference was organised by the Tamil Virtual University.

Here a summary
1. The name TUNE has been changed to TANE - TAmil New Encoding.
2. To make revision on the proposed code chart of TANE- Symbols and numerals
3. To obtain consensus from countries outside Tamil Nadu.
4. To seek support from software developers such as Microsoft and vendors.
5. To chart a migration time table from current encoding to TANE
6. TANE to be made available FREE for the community for easy access in and outside tamil nadu. ( the last tools offered free by GOI was hosted by a server from NOIDA and they were cumbersome) - Prof Balakrishnan comment.

There were few members of TU present at the event. Thiru Ramki, Naa Elangovan, Badri, Anbarasan, Logasundaram ( he is not in this list) members of KTS - Anto Peter and Ananth - KTS President and Sec respectively.

Here is an extract of email send by Elangovan of cadgraf to the Infitt.

To-days meeting was chaired by Dr.V.C.K and inaugurated by Mr.Dayanidhi Maran, Minister for IT and Communications. The Technical Committee meeting was headed by Dr.M.Anandhakrishnan. It was well attended by Microsoft, IBM, CDAC, IISC, IIT, Anna University, KTS, etc; The minister requested for a unanimous decision of all concerned for the most efficient and most appropriate 16bit encoding for Tamil as early as possible as the E-Governance project of the Indian Govt. may have to be rolled out from the beginning of 2007. There was good technical participation from all sectors. At the end, Dr.M.A. announced that there is a near consensus to proceed with the 16 bit encoding for Tamil which need not be the exact replica of TUNE. This will be called 16 bit Tamil encoding. However this will be publicised in all web sites to seek wider consultations world-wide and from all vendors for the next 3 months. The final encoding will be based on the feed backs and suggestions received. The official version of the recommendation will be submitted by Prof.MA to the TN Govt. for their further action

I am happy to see that now there is More time and scope for all technocrats all over the the world to participate and offer their suggestions.

Regards
Elango

So, the name of the new encoding is now TANE rather than TUNE (isn't Tane the Māori god of birds? I may be misremembering something here!).

I heard from several sources there that "Unicode and MichKa were a Western monopoly," which is amusing to me given the fact that Unicode is 100% locked in with ISO 10646 and all - I wonder how the NBs in Japan and other countries would react to being called a Western monopoly.

It was also interesting to be named so directly and prominently in a meeting where a bunch of people don't like me very much. Thankfully several others pointed out the good work that INFITT WG02 has been doing that I have been helping with.

That's the secret -- if you are going to make enemies, be sure that you also make friends!

When it came down to an actual vote, only two voted to keep the current encoding; 43 voted to form a new 16-bit encoding.

And there are two interesting strategic points I'll put up without commenting on as I think they speak for themselves (I have heard them both many times before over the last few years):

  1. That Chinese could get 27000 characters when there govt. put there foot down; similarly Japanese got 500+ for a character encoding system. Similarly if GOI and TN Govt. put there pressure we will get this. If they don't give, we will use PUA and over few years it will become defacto standard.
  2. There is huge cost savings in E-Governance. TN Govt has to spend Rs.3000 crores for there E-Governance initiative where 1GB is needed for each citizen. Adopting TUNE over Unicode will result in saving of Rs.1500 Crores because of Storage savings alone.

Well, I suppose we are all living in interesting times. It will be interesting to see what happens next.

 

This post brought to you by (U+0be7, a.k.a. TAMIL DIGIT ONE)

Comment on the blather
Leave a Comment
  • Please add 1 and 3 and type the answer here:
  • Post
Blog - Comment List
  • "That Chinese could get 27000 characters when there govt. put there foot down"

    What these people may not realise is that in recent years the Chinese government has tried very hard to get nearly a thousand precomposed Tibetan "brdarten" syllables encoded into ISO/IEC 10646 (see N2558, N2621 and N2624), in order to change the encoding model of Tibetan (this is exactly analogous to the Tamil situation); but Unicode and other national bodies stood firm, and they failed. The Chinese government has since been forced to implement their alternative syllabic encoding model in the PUA on Planes 0 and 15 (actually it is more complicated than that, as the government specifies two implementation levels -- Level 1 supporting the PUA precomposed syllables only, and Level 2 supporting PUA .precomposed syllables and standard combining Tibetan). I believe that the Tibetan case provides a strong precedent for not accepting the TUNE/TANE re-encoding of Tamil.
  • I don't understand their complaint anyway. I think it's a GOOD thing to have your script considered "complex". Just look at the cool stuff you can do with Segoe Script when English is being considered "complex"!

    I've followed some of the "discussion" on the unicode list, and I gotta agree - they all seem to have their hands on their ears and saying "la la la"
  • Unicode is very much structured as a glyph-additive system, so far as I can tell. That is, the glyph produced by <(base character) (combining character)> is very similar graphically to superimposing the combining character glyph(s) on the base character glyph, with some adjustments for spacing. Glyph addition - where each glyph (or a couple of glyphs each side of a consonant for some Indic vowels) adds (or subtracts, in case of virama/pulla! - but this is still an added symbol, not removing part of a glyph) a recognisable sound or modifies the sound in a known way - works for many scripts.

    This model does not work for Far East scripts where the same glyphs are used by multiple languages but pronounced differently, hence why the CJK Hanzi characters are simply listed as 'CJK UNIFIED IDEOGRAPH' in the catalogue - there's no 'sound' or concept that can be listed as there is for other scripts, making the diagrammatic reference chart the sole source for which character is which. The 27,000 characters are not solely for Chinese but for Japanese and Korean as well (possibly other languages that use Hanzi too) and you'll hear plenty of complaints from all three groups that they should never have been unified in this way. You can't use components of each glyph as building blocks to make a more complex glyph because the components don't have any real identity or value of their own.

    The only thing then to explain is why the Latin alphabet is pretty consistently encoded with precomposed characters. The answer basically is history. Western and Far East encodings have an awful lot of history. ISO 646 goes way back (as ANSI X3.4 in 1968), the regional variants (mapping parts of the 7-bit set to precomposed modified base characters) date back to 1972. The ISO 8859 series started life in ECMA - I can't find a date for the first edition, but the second edition of ECMA 94 (which became parts 1 to 4 of ISO 8859) is dated 1986.

    The preamble to ECMA 94 states that the reason for including the precomposed characters was because modified characters were typically encoded as <(base character) (Backspace) (modifier)> which causes really horrible processing problems. Ironically one of the standards listed in the preamble, ISO 6937-2, is considered 'difficult to use for processing as some graphic characters are represented by one and others by two [byte] combinations'. Here, however, the modifier preceded the base character and was therefore a bit odd.

    It does feel a bit weird to be 'restricting' a script to a smaller number of characters when Unicode offers such a large range of code points, but there are simply so many scripts to encode that it pays, when not having to offer compatibility with an older standard, to be conservative with allocations than going overboard with largesse and running out of code points prematurely.

    The idea of using the PUA invokes Raymond's common question, 'what if everyone did this?' If everyone did this, it would be impossible to write documents, or encode fonts, containing two scripts that both used the PUA. The word Private in Private Use Area means that it's for the end-user's private use, governments certainly should not be trying to use this area.
  • Huge cost savings of about 1500 crores (by having 512 slots in PUA) due to reduced storage space and each citizen requiring about 1 GB space - laughable and ridiculous.

    Can they explain the basis for these numbers? Can someone tell them the decrease in the storage costs over these years? These government officials think that people are fools.
  • Hi Baskaran,

    I don't think this sort of issue is on the government officials as much as consultants who are paid to do a review and who cast the results in a particular light in order to please those who commissioned the study. If you know what I mean.

    Though every version of the work of which I have seen the methodology has had some rather severe technical flaws in it....
  • Just because Storage Costs are becoming cheaper, it does not mean that I should use 50% more space to store my data.
    Just becuase CPU speeds are increasing, it does not mean that I should inefficiently process my data.
    Processing Speeds and Storage Space is constantly being increased, but the fact remains that at any given point in time, the present Tamil Unicode encoding will make sure that one uses 50% more space and be less efficient by 50% in data processing. The latter is more critical as far as I am concerned, as it is very important for real time processing.
  • Well, in truth the statistics given as part of the "TUNE proof" are very suspect, and the methodology of the testing was never published so that no one can reproduce it....

    Not to mention the issues that Venkatarangan raises here, which give some of the actual results of what such an encoding would do to Tamil if it ever were approved (which it cannot be, due to the violation of Unicode stability policies involving the re-encoding of scripts).
  • 16-bit Tamil in the BMP (outside the PUA) might just be possible, even now.  It could even be consistent with current Unicode - the characters would canonically decompose to their current Unicode encodings.  These new codes would be scattered through the BMP, though.  It would be interesting to see what this, plus support for the old encoding, would do to the processing time advantages claimed for encoding aksharas.  (I would allow the timing tests to use only the new encoding, provided they gave the same results as using the old encoding.)

    I appreciate the new encoding would not be allowed in NFC or NFD, but I am not a fan of compulsory normalisation, especially to something as quirky as NFC.  (NFC gets really quirky when a character has two accents.)
  • Hi Richard,

    You do understand that this is not an opinion shared by the UTC, right? Since re-encoding Tamil is a violation of the stability guidelines?
  • I've understood the argument against scrapping the currrent encoding.  But what stability guideline is breached by adding TAMIL LETTER K with a *decomposition* to <TAMIL LETTER KA, TAMIL SIGN VIRAMA>?  One would naturally add it to the composition exclusions.  Thus one would be adding decompositions, not compositions, and these additions would be consistent with the stability guarantees on normalisation.  We could also add TAMIL LETTER TANE KA, with a compatibility decomposition to TAMIL LETTER KA, if it were important for processing speed that all the aksharas in k- (excluding the KSSA set) have regularly related numerical values.

    Adding such characters in the BMP is as close as Unicode can come to adopting KANE.  This is an option that should be explored.

    The simple data storage solution, of course, is SCSU, though that is incompatible with KANE encoded in the BMP.  Of course, an extended SCSU would be very relevant if KANE principles were accepted as above, but with the new code points in a supplementary plane rather than the BMP.

    Extended SCSU would have 32786 character windows as well as 128 character windows, primarily for handling large character sets such as those of Egyptian and Tangut.  However, only Doug Ewell and I showed any interest in it.

    Richard.
  • If you do not see the problems that adding alternate duplicate encodings add to the stability, security, usability, and overall implementation of any script in Unicode, let alone the violation of the linguistic principles of the language and script, all due to an unproven and inaccurate set of claims about efficiency that have been refuted by experts both inside and outside of Unicode and inside and outside of Tamil Nadu, then I am not sure that I will be able to help you see it....

    And that is ignoring what this re-encoding (which IS a violation of principles that the UTC has officially stated -- Unicode will NOT re-encode scripts. Period.).

    But I care too much about Unicode as a standard and Tamil as a language to injure either one in this way. I will be a part of any constructive solution, but what are you are suggesting is a destructive one, even if it is less destructive than either TUNE or a TANE within Unicode.
  • By 'the problems that adding alternate duplicate encodings add to the stability, security, usability, and overall implementation of any script in Unicode', are you saying that the principle of canonical equivalence does not cure the problems, e.g. that the typically three or five encodings of a Latin vowel with two diacritics (once as one character, once or twice as two characters and once or twice as three characters) still present problems for Unicode-based systems?

    Where has the claim that encoding all the Tamil aksharas as 16 bits speeds grammatical analysis been refuted?

    Encoding Tamil aksharas appears to be in accord with at least some native perceptions of Tamil, and would be justified by the precedent of the Ethiopic script.  It's non-Tamil languages in the Tamil script that may make encoding solely as aksharas unviable.

    Personally I don't like the idea of assigning codepoints to the Tamil aksharas, and putting them in the BMP strikes me as greed as well as bloat.
  • Canonical equivalance does not help if the principal agent of canonical equivalence (normalization) cannot be used to equalize the different text that is canonically equivalent.

    And re-encoding scripts is still not allowred by Unicode.

    When you speak to people in Tamil Nadu, they understand the technical problems with their proposal and it becomes clear that this is more of a political step than a technical one.

    But as to the refutations, I do know of published ones offhand, but I do know that all technical reviews of the "proofs" have pointed out major flaws in the mechanisms used to prove the points that call their validity into serious question. Enough so that the proofs are no longer available for review?

    I'll post more on Tamil in the future, though I will likely be a bit more constructive about it since if talking about Tamil is journalism, talking about TUNE/TANE is like mukraking....
  • The non-standard NFCM (defined below) is a perfectly adequate normalisation for performing text comparison, and indeed NFC and NFD would also both work just as normal.  By NFCM I mean NFC modified by ignoring the entries in the composition exclusions.  It's a true normalisation, not a folding like NFKC and NFKD, but it is not stable.

    The only re-encoding that would go on would be if there needed to be a fixed relationship between the encoding of the graphically minimal aksharas and the others.  And that would be the introduction of canonically equivalent forms, much as one might suggest filling in the gaps in the superscript digits and mathematical letters so that minor mark-up could be converted to characters by calculation rather than by look-up.  Apart from that, what is proposed is simply the addition of precomposed characters.  (Unicode policy on that is, 'No.  Just wait for a better renderer to come along.  Don't bloat Unicode.')

    Does the 'No Re-encoding' rule apply to scripts between Mumbai and Hong Kong?  The meaning of <U+1000, U+1039, U+101B> is changing - one will have to replace the sequence by <U+1000, U+103C>.  The ubiquitous (in its script) <U+1039, U+200C> will have to be replaced by <U+103A>.  The New Tai Lue consonants will be re-encoded for use with the old vowel system and the full set of subscript consonants (N3121 - but the decision in principle predates the approval of the New Tai Lue script.)  The effective removal of Vietnamese U+0340 and U+0341 by making them canonically equivalent to U+0300 and U+0301 might also be dubbed re-encoding.

    I see the lack of any improvement in processing speed as sufficient argument against bloating Unicode with Tamil askharas.
  • The standards that need stable and predictable results cannot use the non-existent "NFCM" which is not a part of normalization. And since it cannot be used and since normalization is defined by Unicode, it is not an answer.

    Actually, most of what you are saying is either incorrect or misleading (mathematical letters have intrinsically different properties than letters and are not letters in any real sense, as an example, and the suggested Myanmar changes are an incredibly controversial issue that involves sequences that CANNOT be encoded properly in the current model which does apply to Tamil or any of the TUNE/TANE arguments).

    And because of this, the whole comment is misleading and confusing in the context of TUNE/TANE. I'd rather avoid misleading people if possible, which is why keeping us on the actual topic here would make more sense than bringing up less than relevant examples of exceptional cases that do not establish precedents for Tamil to use.

    Now as to processing speed, I also disagree with you.

    English processing for many operations could be make significantly faster if we re-ordered the letters used in English to intersperse the uppercase and lowercase letters. That is NOT an argument to re-order ASCII that anyone in their right minds would accept. And it does not apply to Tamil either, even if proof did exist. It is an attempt to use an out of scope problem as an argument for an improper change.

    The same implies to the need to capture the dozens of phonemes that exist in English by encoding characters rather than relying on the five that exist in the alphabet. Encoding such characters would make spellcheckers, thesauri, and such MUCH more efficient. Does this mean that we need to encode all those new characters, due to Unicode's vicious refusal to allow efficient processing in these tools? No, it does not -- for the same reason as the Tamil change is out of scope.

    The fact that the Tamil arguments are all unproven (and that the original flawed proofs were withdrawn) is just a bonus in showing the underlying actions of those proposing the change; the fact is that the entire attempt is out of order.
Page 1 of 2 (22 items) 12