Thursday, September 04, 2014

[babbryvn] Decimal line lengths

To (hopefully) accelerate the parsing of text, prefix each string (probably each line) with a decimal number indicating its length in bytes (Unicode might make this tricky).  However, the length can be from 1 to about 16 digits long, so prefix that length with the length of the length.  Finally, the length of the length can be 1 or 2 digits long, so a one-digit prefix with the length of the length of the length.  For redundancy, after the end of each string should be a string terminator, i.e., newline.

Write a program to do this; this should be easy.

Some examples:

Zero character string
1,1,0:

3 character string
1,1,3:foo

10-character string
1,2,10:dictionary

Billion-character string
2,10,1000000000:<...>

The scheme will overflow for strings longer than about 10^1000000000, which surprisingly is not that unreasonable for some hypothetical dynamically generated randomly accessed data.
10,1000000000,<billion digit number denoting the length>:<very long string of length 10^billion>
(Approximately, the list of all billion-digit numbers).

If we were beings who enjoyed numbers in base 27, then two lengths would suffice up to about 27^27 = 1.3 * 2^128, which should be enough for everyone.

No comments :