Friday, February 23, 2018

[gmiuenaz] Base 30

Yet another partial attempt (another attempt) at creating something like MIME quoted-printable, a way of encoding any Unicode code point using only printing 7-bit ASCII characters while still keeping English text vaguely readable (so not say Base64 applied to UTF-8).  If there's anything I've learned, there are many ways to do this.

We want the letters a-z to keep plain text readable; let's choose lowercase.  We want space, though maybe we want to use underbar or period as space (similar to URL encoding using plus as space).  We want an escape character, say semicolon (;).  We choose it because it is on the home row on QWERTY keyboards.  Add parentheses because matched delimiters are useful if we want to encode hierarchical structure.  This yields 30 characters.  We could do more, but for simplicity start here.  Previously similar.

Similar ideas: if the escape character were backslash (\), text would look like C strings or LaTeX.  If the escape character were ampersand (&), text would look like HTML.

Characters that are not the lowercase letters or space are encoded in two possible ways:

  1. Long Parenthesized escape sequences ;(something)
  2. Short escape sequences ;lowercase

Let's choose uppercase to have short escape sequences, specifically the escape character followed by the letter repeated: ;aa ;bb...  Numerals should be similar, though there's a choice between (a=0 ... j=9) or (z=0, a=1 ... i=9).  The former sorts better; the latter follows models like Greek and Hebrew which have used letters as digits though perhaps not in place-value systems.  (Or we could just add 0-9 to the 30 letters.)  Let's choose 0=;z ;a ;b... ;i=9.  Numerals get the second shortest escape sequences.

Short escape sequences in general start with semicolon followed by a sequence of lowercase letters.  How should a short escape sequence end?  Several possibilities we could choose: a space, the escape character again, or the sequence of lowercase letters encodes its endpoint.  The latter is what UTF-8 (very roughly) does: the high bit in each byte indicates we're still in an escape sequence.  More generally, the escape sequence could traverse a Huffman-tree-like structure where in we know we've hit a leaf because we know the structure of the tree.

Variations possible: if we had more than 1 escape character, then different escape characters could be the root of different trees.  UTF-8 could be interpreted as having 128 escape characters.

Let's choose short escape sequences to end with either a space or the escape character again, but the latter signifying the start of another escape sequence.  This keeps compact things like digit sequences.  It does introduce a few awkwardnesses: the encoding of a character requiring an escape sequence followed by a character not requiring one will awkwardly have a space in it: camelCase = camel;cc ase.  A character requiring an escape followed by a space awkwardly needs to be encoded with two spaces after it: the capital letter ;aa  has decimal ;aa;ss;cc;ii;ii  value ;f;e .

Short escape sequences are reminiscent of HTML entity references, though the latter use characters beyond a-z, like uppercase.  They also explicitly mark the end of a reference with a special character, semicolon.

There is no difficulty in finding the end of a Long Parenthesized escape sequence so long as we enforce that parentheses match.

The escape sequence of semicolon immediately followed by a space is uniquely the shortest escape sequence.  Let's keep it unassigned; maybe it gets used by the UI to signify something special like leaving typing mode (similar to vi), or the user can assign it.

At this point, all that remains is to assign escape sequences to the all the rest of the Unicode code points.  Long escape sequences permit many schemata, each preceded by a schema identifier.  One possible schema is a formula to convert back and forth between a Unicode code point number and a digit sequence in base 26.  It's not strictly a number is base 26 because leading "zeroes" (the letter a?) matter.  Use the formula for the sum of a finite number of terms of a geometric series.  Unfortunately, some will spell out vulgar words.  Or limit the letters to avoid many.

Long parenthesized escape sequences also intriguingly permit a rich Lisp-like language with different combinators for expressing the how a complex Unicode character is put together from components.  This requires the structure of complex Unicode characters be broken down into their component parts.  This probably already exists.  Previously, thoughts on Japanese and investigation into Korean (Hangul).

No comments :