may contain source code

published: • tags:

This article is part of a series about implementing Unicode support for std::string:

  1. Unicode for std::string
  2. Fast Lookup of Unicode Properties

The implementation of my unicode_algorithm library had barely started when I encountered a problem I had not considered at all.

Unicode code points have a lot of properties for things like the character type, uppercase/lowercase mappings, bidirectional mirroring, and so on. The data is freely available and it would be easy to simply dump it into an array indexed by code point number (in the following called the naive array). However, code points have a valid range of 0x0 to 0x10FFFF for a total of 1,114,112 code points. That array is huge!

Granted, currently only about a tenth of that range is assigned. Leaving out the unassigned code points would shrink the array significantly, but would also require checking if a code point is assigned or not. Properties are accessed constantly when working with unicode text. Even small speed disadvantages can impact overall performance significantly. All in all assignment checks – or any conditionals in general – are something to be avoided.

Concrete documentation about how that problem is solved in practice is hard to find. The Unicode specification gives a few hints about multistage tables, but does not go into specifics. It took me quite a while, but with an article from strchr.com, lots of digging through source code and a bit of experimenting I finally figured out what this multistage table thing is all about.

Killing Redundancy

Unicode property data is highly redundant, which can be exploited to implement a data compression scheme. How much you can save depends on which properties you store and how exactly you implement the compression and lookup. Roughly a 99% size reduction seems to be the norm in practice, incredible as that may sound on first glance.

Imagine the naive array mentioned above. From it we are going to create three linked arrays in such a way that a lot of the redundancy disappears and lookup does not get more complex than a few array lookups and bit shifts. That keeps memory consumption low and lookup fast.

  1. Unicode tends to group similar characters together. That leads to long runs of code points with the identical property data. If we divide the naive array into fixed size blocks a lot of those blocks are identical. This leads to the first two of the three arrays. The second one contains the blocks, eliminating all duplicates. The first one indexes into the second one to get to the correct block for each code point.
  2. The second array still duplicates a lot of data. We introduce a third array holding the actual property data, again eliminating duplicates. Then we change the second array to contain indexes into the third one. Because indexes are significantly smaller than the actual data in all but the most trivial cases, we save a lot of space.

Essentially we now have a three-stage system that starts with a particular code point number. A part of that number is used as the index into stage 1. Then stage 1 together with the rest of the code point number indexes into stage 2 which indexes into stage 3 which contains the actual property data.

The diagram above is still a high level overview glossing over a few finer points, for example the exact role of the code point number. To explain these we’ll take a closer look at a real implementation.

3-Stage Lookup Implementation

Basically all the professional implementations work the same – at least the ones I looked at: CPython, the D standard library, utf8proc. The following describes my own implementation for unicode_algorithm, which follows the same tried and true approach.

Even within that approach there is some wiggle room to fine tune the tradeoff between size and performance. I’ll probably discuss that in more detail in a future article. Here is quick overview:

  • Three stages.
  • 16/8 bits (2/1 bytes) split of the code point number, resulting in a block size of 256.
  • Stage 1 contains stage 2 indexes instead of block numbers.

Any Unicode code point can be represented by three bytes because the highest possible code point number is 0x10FFFF. In my implementation the two most significant bytes – the first four digits in normal human-readable hexadecimal notation – are used as the index into the stage 1 array. The remaining least significant byte is a part of the index into the stage 2 array. Where the code point number is split also determines the block size in stage 2. One byte as the partial stage 2 index gives us 256 possible values and a block size of 256.

For efficiency the blocks are stored consecutively in a simple array (stage 2). Still, the boundaries between blocks are real boundaries. In no situation are blocks allowed to overlap.

Building the Stages

This is a general description of the algorithm used to create the multistage tables.

Start with three empty arrays for the three stages and an empty current block. Then iterate through the complete code point range. For each code point do the following:

  1. Assemble the set of property data for the code point. If that exact set is not in stage 3 already, append it. Remember its index in stage 3.
  2. Append the stage 3 index to the current block.
  3. If the current block is now full:
    1. If the block is not in stage 2 already, append it. Remember the stage 2 index of the block start.
    2. Append the stage 2 index to stage 1.
    3. Make the current block an empty block again.

When all code points are processed stage 1 has exactly 4352 indexes, from 0x0 to 0x10FF. Or in other words: If we chop off the least significant byte of each code point number, stage 1 has an index for each of the resulting numbers. The lengths of stages 2 and 3 are a bit more complicated. They depend on the version of the Unicode standard we work with and on the exact set of properties we need to store for text handling purposes. The length of stage 2 is the number of unique blocks times the block size. Stage 3 has as many indexes are there are unique sets of property data.

For my unicode_algorithm library I have a preliminary C++ implementation of the three arrays, generated by a Python script. The code looks like the following. Because the boundary class is the only property I need at the moment, stage 3 is a bit rudimentary. That is going to change in the future along with the lengths of stages 2 and 3.

constexpr const std::array<std::uin16_t, 4352> stage1{
    0, 256, 256, 512, 768,
    // a lot more numbers ...
    256, 256
};

constexpr const std::array<std::uin8_t, 20992> stage2{
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,
    // a lot more numbers ...
    0, 0
};

constexpr const std::array<character_properties, 14> stage3{
    {boundary_class::control},
    // ...
    {boundary_class::regional_indicator},
};

Looking up Properties

The data structures are complete now. Here is an example of the complete lookup procedure, using code point 0x1F601, the grinning face with smiling eyes character. The example assumes that in stage 2 block 10 is the correct one, and that the actual property data is located at index 1 in stage 3.

Step 1: Take the code point’s two most significant bytes and use them as the index into stage 1. The number there tells us that the correct block in stage 2 starts at index 2560.

Step 2: The remaining byte of the code point number is the offset into the block. Add the number from stage 1 to it. That gives us the correct stage 2 index.

Step 3: The number from stage 2 tells us the correct index in stage 3. Go there to find the actual property data.

In C++-ish notation the lookup would look like this:

uint32_t cpoint = 0x1F601;
Properties props =
    stage3[
        stage2[
            stage1[cpoint >> 8] + cpoint & 0xFF
        ]
    ];

The right shift (cpoint >> 8) chops off the least significant byte from the code point number. The binary and (cpoint & 0xFF) extracts the least signifcant byte for the stage 2 indexing.

Comments