git: 9front

ref: 20f516bf800e367dbb3cbec0fc892874b205b625
dir: /sys/doc/unicode.ms/

View raw version
.HTML "Modern Unicode Requires Modern Solutions"
.TL
Modern Unicode Requires Modern Solutions
.AU
Jacob Moody
.AB
Plan 9 has a history with Unicode, the authors of which were
directly involved with the creation and first system-wide
implementation of UTF-8. This should come as no surprise with the
focus on plaintext that permeated UNIX, and how that focus had carried
over into Plan 9. However, the world around Unicode has changed much
since its early versions, becoming more complex and detail
oriented as the version number climbs higher. While Plan 9 from Bell
Labs was never updated to any major version of Unicode beyond the first,
9front has made incremental progress over the years in adopting newer
standards. We aim to evaluate the current state of the union with
Unicode support, and different natural languages at large, within the
current 9front system, both where the system is at now and where we
would like it to go.
.AE
.EQ
delim $$
.EN
.SH
Background
.PP
.FS
First presented in a slightly different form at the 10th International Workshop on Plan 9.
.FE
.LP
Computers do not handle abstract notions such as natural
language particularly well, so we rely on standards for representing natural
language as a sequence of numbers. The mapping of
numbers to natural language characters is known as a character
encoding. While there have been many character encodings
invented and used over the years, the one that has seen the most
widespread adoption now is Unicode. Unicode is unique among other
character encodings before it in the fact that it plans to have a
single encoding for all natural languages, done in such a way that a
single document could contain language samples from all natural
languages without out of band code page switching or the like (as was
common before Unicode).
.LP
With Unicode's goal of serving every natural language and the
vast differences between natural languages there comes a need for
careful definition of terms that the standard uses[Glossary].
To ensure some clarity let us get some of these definitions out in front:
.TS
l l
^ ^
l l
l l
l l
^ ^.
Abstract Character	A unit of information used for the organization, control,
	or representation of textual data.
Codepoint	Any value within the Unicode codespace. In convention with
	traditional plan 9 terminology, we use rune as a synonym for codepoint.
Encoded Character	A codepoint which has been mapped to an abstract character
Grapheme	A minimally distinctive unit of writing in the context of a particular
	writing system
Glyph	The actual, concrete image of a glyph representation having been
	rasterized or otherwise imaged onto some display surface.
.TE
.LP
With that being said, let us define how these terms relate. Broadly
abstract characters are information that we wish to encode in the
character encoding, this is everything from graphemes, to grammar
markings, to components of a letter, to informational control
characters. Unicode gives us a large bit space to work with, so a
codepoint is a single index into that bitspace, the standard then
gives us the mapping for relating these codepoints to abstract
characters. A large majority (but not all) of encoded characters
represent graphemes. Graphemes are just one possible category
that an encoded character can be within. Unicode only really defines these
encoded characters, the rendering and presentation of these encoded
characters is the responsibility of the rendering layer who takes
encoded characters as input and outputs glyph images.
.LP
The code space of Unicode is currently 21 bits, and with computing
hardware mostly rounding that up to 32 bits, that can be a non-trivial
amount of storage for large amounts of text. For this the Unicode
standard also defines a few compact encoding schemes, the most
popular of which is
.CW UTF-8 ,
encoding the Unicode bit space within a variable number of bytes up to four.
.LP
Unicode also provides multiple codepoints (or a sequence of
codepoints) for representing the same abstract character. A large
chunk of these characters are those which have been identified as
being composed out of two or more subcomponents. The most common
example of this are characters that have diacritical markings, which
can either be represented as one codepoint for the "precomposed"
variant or as a series of codepoints with the base codepoint first
followed by codepoints which denote individual diacritical markings.
.LP
Considering that there are multiple different representations of the
same abstract character, there is a method to normalize a series of
codepoints into a consistent representation for semantic comparisons.
The Unicode standard[UAX#15] does define a process for conducting this
normalization, and provides 4 different forms to normalize to:
.TS
l l.
NFC	Precomposed codepoints
NFD	Decomposed codepoints
NFKC	Precomposed compatibility codepoints
NFKD	Decomposed compatibility codepoints
.TE
.LP
The compatibility forms here do additional conversion of special
formatted character sets such as codepoints for specific fonts,
rotated variants, and so on into their base
counterparts. This operation loses some information regarding the
original codepoints thus is separated from typical normalization. There
are some additional rules that are part of normalization so as to
enable checking the semantic equivalence between any two sequence of codepoints.
.SH
Existing Unicode in Plan 9
.LP
Plan 9 was an early adopter of Unicode, but never reached full
compliance for any standard version of Unicode during its history.
Pike et al put in a lot of work for handling UTF-8
consistently within the system[Pike93] and a majority of the system was
updated to handle codepoints. Plan 9's initial support could
be defined succinctly as supporting "21-bit ASCII encoded as \f(CWUTF-8\fR",
where each codepoint is assumed to be exactly one abstract
character and interfaces broadly assume that mutations to text (ie
backspace in a text editor and the like) operate only on single code
points as well. The result of this is that the system can only really
handle precomposed Unicode in any sort of ergonomic way, however Plan
9 is very consistent in this level of support and has been ever since its
last official release.
.LP
Plan 9 had initially only been written to support Unicode version 1.0
(16-bit code space), so one of the earliest contributions from the community
was to update this to the modern 21-bit code space. This work was first
done in 9atom[9atom], and was later done in 9front. This includes
work to make more of the system Unicode aware that wasn't
previously. The version of awk which ships with the system is a notable
example. There were also some smaller changes like updates to the
input "compose" sequences to allow a wider range of codepoints and
allowing users to manually input them by typing out their
direct value in hexadecimal. However even with that work there
have been portions of 9front that had yet to still be updated past
Unicode version 1.0.
.SH
Using Tables to Keep Pace
.LP
The work being presented in this paper largely started with wanting to
update some of the existing interfaces within the system, namely the
.CW isalpharune (2)
set of functions, in order to not only bring them up to date,
but also allow their adaptation to future versions of the standards.
To that end, the Unicode consortium publishes a
series of informational databases[UAX#44] in an awk-friendly format. Plan 9 users
may be familiar with one of these data files as the original
.CW /lib/unicode
was a cut-down and reformatted version of the Unicode
Character Database (UnicodeData.txt). These files are updated and
maintained as the Unicode versions change, so the natural course of
action seemed to be to read these files as input to generate code that
could be then used to implement whatever functions we may need.
.LP
To better illustrate the input data, below is an excerpt of UnicodeData.txt
for U+2468. Each line of UnicodeData.txt corresponds to a single codepoint,
with various properties of said codepoint separated by semicolons. Both
the keys and values for each field are documented by the Unicode standard[UAX#44].
.P1
; look -b 16 2468 /lib/ucd/UnicodeData.txt 
2468;CIRCLED DIGIT NINE;No;0;ON;<circle> 0039;;9;9;N;;;;;
.P2
.LP
The parsing of these text input files is largely unremarkable, we read
in one line at a time and denote properties and conversions as we move
ourselves along. The challenge comes in picking and generating a
reasonable data structure for the function implementations to use,
balancing both the size of the data structure in memory along with the
runtime complexity for walking the data structure. Plan 9 binaries
are currently quite small in comparison to the hardware available, so
we had decided to err on the side of increasing the binaries size
in exchange for better performance.
Because of this we decided to implement direct lookup
tables, using codepoints as an index and having an O(1) lookup time
for the vast majority of keys.
.LP
With that said, multiple arrays each with
$2 sup 21$
entries is quite
untenable even given the potential performance savings. To mitigate
this we can use properties of the distribution of abstract characters
over the Unicode code space to our advantage. Unicode code space is
distributed up into chunks, with natural language roughly being
contained within their own contiguous range of codepoints. This
means that for many of the properties we care about there are hotspots
and coldspots throughout the code space. As an example, you could
consider the
.CW toupperrune ()
function, which largely is only applicable
to natural languages which feature distinct letter cases. Many
eastern Asian languages have no concept of casing so those regions
of the Unicode code space don't need defined values. This makes
the lookup table quite sparse, and therefore a good candidate
for some method of compression. This is not unique to letter
casing, and in fact all of the lookup tables for
.CW isalpharune (2)
end up being sparse tables.
.LP
To compress the lookup table, we use a data structure that [Gillam02]
referred to as a compact array. We start by breaking up the lookup
table into equally sized segments, then walk those segments and find
any overlap (full or partial) between them. We can then remove any of
the overlapped entries within the table. In order to access these
overlapped segments we then need a higher level lookup table which
points into these segments of the lower table. In order to use a
single key (a codepoint in our case) for indexing in to both tables,
we split up the bits within the key into two different portions, the
top bits indexing in to the top table, the result of which is added to
the bottom bits in order to index in to the bottom table.
.LP
Given how large the code space is for Unicode, it becomes beneficial
to do not just one layer of compacting, but two, where the higher
lookup table is compacted with the same method, resulting in a new top
level lookup table. This splits up the codepoint into three parts,
the result of the top level table being used to index the
middle table which is then used to index the bottom table which
contains our desired value. The following code illustrates an example
of this lookup method:
.P1
#define upperindex1(x) (((x)>>(4+7))&0x3FF)
#define upperindex2(x) (((x)>>4)&0x7F)
#define upperoffset(x) ((x)&0xF)
#define upperlkup(x) (_upperdata[_upperidx2[_upperidx1[upperindex1(x)]
	+ upperindex2(x)] + upperoffset(x)])
.P2
In this example, the top ten bits of the codepoint index the top level
index, the result of which is combined with the next 7 bits of the
codepoint, with that result of that being added to the remaining 4
bits of the codepoint. Which amount of bits to use for each section is
not exactly clear, so we brute forced the smallest table size through
testing all possible values.
.LP
Of course in order for this to be effective we need to ensure that the
values of our lookup table are sparse. For simple property values this
is typically already the case, and for functions like
.CW toupperrune ()
we can instead store a delta between the current rune and the desired
rune, which gives us nice runs of zero values for anything without a
defined transformation. Using this method we've gotten the following
results for the
.CW isalpharune (2)
set of functions:
.TS
l l
l n.
Table	Size (bytes)
toupper	12436
tolower	11754
totitle	12180
is*rune	24184
total	60554
.TE
.LP
The
.CW is*rune ()
table here merges all of the different categories into
one table as a bit field, one bit per possible category, so the result
is less sparse than the other tables. We find the expense of 60KB to
be well worth the benefits gained from the ease of upgrade, and
constant time lookup speed.
.SH
Keeping Composure
.LP
Recall that normalization is the process of transforming a series of
Unicode codepoints into a canonical variant for a given target format.
In order to do this, we'll need tables that map individual codepoints
to their transformed values for each format, their composed or decomposed
variants. We can use tables largely similar to the ones described above, with
some extras where needed to handle more specific edge cases.
.LP
For decomposition our lookup key is the same as before, the codepoint
of the rune we wish to decompose. However the value of the table gets
a bit more complex. The most straightforward approach would be to
combine the two new runes as one value, but this is fairly painful
for the table size. We'd need a 42-bit value, which would have to be
rounded up to 64. Not only are we wasting over a third of the storage
space due to the rounding, but using 64-bit values would be quite
onerous for our 32-bit systems. As it turns out however, there is a
heavy bias for the new decomposed runes to reside within the lower 16
bit code space. In the latest version of Unicode at the time of
writing (16.0) there are only 158 runes that result in decompositions
above the 16-bit code space. It becomes easiest then to have our
compact array use 32-bit values (two 16-bit codepoints), and then
have an exceptions table which stores those 158 runes.
The base table values for these exception runes are instead given
within a small portion of the Unicode private space (0xEEEE on up),
with the difference from the start of the exception being used to index the list of exceptions.
This allows us to maintain our constant time lookups, even when
accessing members of the exception range.
.LP
Recomposition required a different approach with similar constraints
as the decomposition. Instead of packing two runes into the output of
the table we instead need to pack two runes into the input of our
lookup, and produce one rune as output. The bias towards the 16-bit
code space remains, so we can limit our main lookup table to taking
two 16-bit runes as input and outputting a single 16-bit rune.
For exceptions, (where either the input runes or output runes exceed 16 bits)
we use a similar exception table. However in this case we have to search this
linearly because otherwise we would need to increase the table size to accomodate
the larger lookup keys. The differences from decomposition make it much harder to make use of
our existing lookup table, so we instead opted to implement a hash
table.
.LP
A hash table is going to have a worse average lookup time compared to
our compact arrays (assuming moderate collision), but stays within the
same ballpark of performance and size as our compact array. Because
we're generating this table at build time instead of at run time, we have
to find a way of embedding the entire structure in a contiguous block
of memory. This is easy to accomplish for our first result of the
hash function, where we have one large chunk of memory: we hash our
input key then modulo it by the size of our initial memory buffer.
For collisions however we need to find some way of chaining the
results in a way that can be searched until a key matches. For this we
create another array that is appended onto the first.
The main entry points are chained to the
collisions by embedding a numeric offset from the start of the
collision space. This results in a hash table entry that looks like the
following:
.P1
struct KV {
	uint key;
	ushort val;
	ushort next;
};
.P2
In the interest of saving as much space as possible for the generated
array, we can compact this struct into a series of paired integers, the
first being the merger of the two 16-bit codepoints, and the second
being the merger of the value and optionally the offset to the collision space.
.LP
As with choosing the constants for compact arrays, we found an
optimal size for the original hash array versus the collision array
largely through brute force. Given the Unicode standard at the time
of writing, we are using 512 initial buckets, which forces 555 entries
into the collision array. With a hash function that provides a
reasonable equal distribution, this should average to roughly only one
collision per entry. Using these two data structures results in
the following generated table sizes:
.TS
l l
l n.
Table	Size (bytes)
decomp	20108
recomp	8536
.TE
.SH
Normalizing Theory
.LP
With composition handled we can then start to tackle normalizing a
string to one form or the other. The implementation of Unicode
normalization can be quite finicky and there are plenty of
considerations for how the input must be fed to result in conformant
output. However before we can get into those, we need some additional
tables that will be used in the normalization process.
.LP
The first additional table we need is one that stores the
.I "Combining Character Class"
.CW (ccc)
of a codepoint, this is, the value which identifies
the attaching property of the character. A value of non-zero
signifies that the codepoint is an attacher, the specific value
denoting the group of attachers that it belongs to.
Typically only one attacher out of each group may be considered
for attachment to a base codepoint.
The ccc value is also used to ensure the ordering of attaching
codepoints is consistent in
.CW NFD
formatted strings[UAX#15].
The other value we need is the Quick Check (qc)
value of a codepoint which is used to determine where a normalization
context ends in certain situations. We'll elaborate on this further
in the following section. Our existing compact
array structure works well enough for these lookup tables as is,
resulting in the following sizes:
.TS
l l
l n.
Table	Size (bytes)
ccc	9738
qc	7878
.TE
.LP
The abstract logic for normalization is described in [UAX#15] and is
as follows:
.P1
1. Decompose every rune possible in the string until it
   can no longer be decomposed
2. For codepoints without a ccc of zero, conduct a
   stable sort of them based on their ccc value
3. If we are normalizing to NFC, recursively recompose code
   points in the string until there is nothing more to recompose
.P2
.LP
The main challenge for normalization then is designing an algorithm
that can do all of this on a single pass of the string. To accomplish
this, we iterate through a string and accumulate runes into a context.
A context conceptually aims to localize the effect of normalization, breaking
it down into chunks, such that a chunk can be processed independantly.
This allows very large strings to be processed piecewise and within
reasonable memory boundaries. We implemented the context by using a stack,
which is cleared and reused for multiple contexts as we parse through the string.
The process by which we construct the first part of our context is as follows:
.P1
1. Take the next rune from the string and push it onto our stack
2. If the following rune has a ccc not equal to zero,
   push it on the stack. Repeat until we encounter a rune with a
   ccc of zero.
3. Perform a stable sort of the stack by the ccc value of the codepoints
.P2
.LP
At this point if we are normalizing to a decomposed form (NFD) our context
is finished and may be flushed. Continuing the process again for the next
rune in the string.
.LP
If we're recomposing, we now must classify which type of context that we have
before we continue.  

Here we split into one of two recomposition strategies based on which
type of context we have. Strategy A is used for sequences of codepoints
which have attachers (ccc of non-zero), and strategy B for sequences
of composable base runes (ccc of zero).
.LP
.P1
ccc of non zero/Strategy A
1.  Use the base codepoint as the left hand side input to our
    recomposition
2.  Use the next immediate codepoint as the right hand side input
    to our recomposition.
2a. If a match is found, replace the base rune with the new result
    and discard the right hand size codepoint. The new codepoint will
    be used as the base codepoint for further iterations.
2b. If a match isn't found, skip through all other codepoints
    in the stack that have the same ccc as the failed match.
3.  Return to step 1 with and continue to attempt matches with
    remaining codepoints in the stack until the stack is depleted.
.P2
.LP
.P1
ccc of zero/Strategy B
1.  Continue to pull codepoints on to the stack from the input
    string until we reach a boundary, as described by the
    Quick Check value.
2.  Walk through each codepoint in the stack, attempting to
    compose it with the codepoint immediately to the right of it.
2a. If a match is found, replace the left hand side codepoint with
    the result, and discard the right hand side codepoint.
2b. If you don't find a match, continue through the stack with
    the new left hand side being the current right hand side.
3.  Go back to step 1 if any recomposition was found until we make a
    pass without any recompositions.
.P2
.LP
After either of these two steps the resulting stack can be flushed out
as normalized output. The stipulation to bypass other runes which have
the same ccc (step 2b in strategy A) is stipulated by the standard[UAX#15]
and required for a conformant implementation.
.LP
There does exist an edge case in normalization, as the standard itself
puts no limit on the amount of attaching codepoints that may be given to a
base codepoint. This can be seen in what is typically
referred to as "zalgo text", where a single base codepoint is given
a large amount of attachers to produce a "glitchy" appearence.
While there is no natural language which
relies on a large number of attaching codepoints, it remains an edge
case our algorithm must come to terms with. In order to perform this
normalization within a bounded memory constraint we need to pick some
reasonable maximum amount of attaching codepoints that we plan to support.
Unicode defined a subset of normalized text referred to as
"Stream-Safe text"[UAX#15] which dictates that no more than 30 attaching code
points may be given to a base codepoint, and for sequences larger than that,
attaching codepoints can be broken up with
.CW U+034F
(Combining Grapheme Joiner). We implement this recommendation, which means our
algorithm will insert a CGJ into a text stream if we encounter large
enough runs. This is required if we want to output properly normalized
text, as without it the ccc context would grow to a length that would
be untenable to sort. The CGJ acts as a new base, and codepoints
which attach to it can be sorted independently of those on the base
codepoint. The rendering layer should be able to interpret the
attaching codepoints on the CGJ as part of the previous base code
point.
.SH
Normalization Interface
.LP
With the general algorithm for normalization figured out we need to
decide on an interface for exposing this to other programs. Due to the
nature of normalization this can be somewhat complicated to do well.
As illustrated earlier, context expansion in both types of context
continues until we find the beginning of the next context. In strategy
A it is until we find a ccc of zero and in strategy B it is until we
find qc boundary. This means that for most normalization we need to
know when an input ends in order to flush the last context, since
there is no next context to act as our terminator. This is relatively
easy if we're operating on the entire string at once: we can walk
until we reach the end of the input string and use that as a sign to
flush whatever remaining context we may be holding on to. If we're
dealing with a stream of input, from a file/pipe/network connection,
then we may wish to normalize one chunk at a time. We then need to maintain
the remaining context of one chunk into the beginning of the next
chunk. Other implementations handle this issue by implementing a
concatenation function, which carefully conjoins one normalized string
to the other with only contextualizing the edges. We decided to
instead implement only a streaming interface, which can be used for
both concatenation and streaming input. While pushing two strings
that had been independently normalized back through the streaming interface
is not as efficient as only looking at the boundaries, we think that
the normalization implementation itself should be sufficiently fast to
have a negligible performance impact in a majority of cases.
.LP
The following streaming interface is what we had settled on:
.P1
typedef struct Norm Norm;
struct Norm {
	... /* private */
};
extern void norminit(Norm*, int comp, void *ctx, long (*get)(void*));
extern long normpull(Norm*, Rune *dst, long sz, int flush);
.P2
.LP
In addition we provide the following wrapper functions which target
normalization for smaller fixed set inputs:
.P1
extern long runecomp(Rune *dst, long ndst, Rune *src, long nsrc);
extern long runedecomp(Rune *dst, long ndst, Rune *src, long nsrc);
extern long utfcomp(char *dst, long ndst, char *src, long nsrc);
extern long utfdecomp(char *dst, long ndst, char *src, long nsrc);
.P2
.LP
With the streaming interface: the
.CW flush
flag in
.CW normpull ()
specifies whether we should preserve the remaining context once we reach an EOF from a
.CW get ()
function. This allows callers to normalize fixed chunks
at a time, then call with the flush argument set to non-zero when
there is no more input to give. The
.CW ctx
is a caller supplied pointer which is also passed forward on to the
.CW get ()
function. If
.CW Comp
is non-zero the string is composed, meaning the output is
.CW NFC
instead of
.CW NFD .
.LP
The Unicode standard publishes an extensive set of tests[UAX#44] to ensure
that normalization implementations deal with all of the possible edge
cases which show up within expected realistic input. We verified
our implementation against this series of tests, and added support
for normalized strings to
.CW tcs (1).
.SH
Preparation and Future Work
.LP
While normalization itself is not within the crucial path to
generating a better conforming Unicode interface, the necessary infrastructure
for implementing normalization provides the foundation for implementing more comformant
interfaces.
For this paper we want to focus on what we think is
the next step, which is a more conformant interface for Plan 9
programs which use
.CW frame(2).
This would be
.CW rio (1),
.CW acme (1),
and
.CW samterm (1).
.LP
As implemented today, these programs assume that each codepoint is its
own abstract character and extra attaching codepoints are treated as
individual, separate characters. Instead of attaching combining
characters to their base codepoint rio, sam and friends will render
them on their own, resulting in a dangling diacritic in the best cases
and a stray PJW in the worst. This results in decomposed text being
largely unusable within these interfaces. If we would like these
programs to handle such text better, we need interfaces that allow it
to find logical breaks within the text. Unicode defines[UAX#29] two
types of breaks, grapheme breaks and word breaks.
.LP
Grapheme breaks are boundaries between codepoints in which a user may
consider it to be one logical abstract character. The codepoints
between these breaks (referred to as a grapheme cluster)
should be considered as one abstract character by editors.
For example, an editor may choose to interpret a backspace to be the
deletion of codepoints up until the previous grapheme break. Likewise
an editor should not put a line break within a grapheme cluster.
Another consideration may be whether
.CW sam (1)'s
structured regular expression should be changed to interpret grapheme
clusters as one logical character or as independent codepoints.
.LP
Word breaks are similar to grapheme breaks, but for exclusively
finding where in a series of codepoints an interface may choose to
place a line break. This prevents cases like placing a line break
halfway through a series of numerics, or in general places which might
confuse a native speaker when reading the text. Of course it is not
always possible to abide by these rules, as the width that an editor is
working with may be shorter than the full context of a word break
(consider a file full of 1000 1's), so at best it serves as an
optimistic placement. While nice to have for text editors, we
believe that a more important application lies in typesetting documents.
.LP
The algorithms for finding these breaks are less involved than
normalization, and we were able to again use our compact arrays to
store the required lookup information for codepoints. As part of this
paper we implement the following set of functions for finding grapheme
and word breaks within an input string:
.P1
Rune*  runegbreak(Rune *s);
Rune*  runewbreak(Rune *s);
char*  utfgbreak(char *s);
char*  utfwbreak(char *s);
.P2
The return values of these functions point to the beginning of the
next context.
.LP
Of course these functions are only the start of implementing proper
support in our editors and interfaces. Future work may look to
reimplement
.CW frame (2)
to make use of these functions for implementing editors that work
well in languages other than English. We have so far also
largely ignored the rendering portions of these interfaces. It is
another important step required for conformance but the rules for
rendering extends beyond what is specified in Unicode and is best
saved for a future paper which explores the full complexity of that
problem space.
.SH
Acknowledgements
.LP
Special thanks to Ori Bernstein, cinap_lenrek, and the rest of the
9front community for their support in designing and thinking about
this problem space. The work as it is presented would have not been
possible without their advice and review.
.SH
References
.LP
[Gillam02]
Richard Gillam,
``Unicode Demystified: A Practical Programmer's Guide to the Encoding Standard'',
.IR "Addison-Wesley Professional" ,
January,
2002
.LP
[Glossary]
Unicode Inc,
Glossary of Unicode Terms,
https://www.unicode.org/glossary/,
.LP
[Pike93]
Rob Pike, Ken Thompson,
``Hello World or Καλημέρα κόσμε or \f(Jpこんにちは 世界\fP'',
January,
1993
.LP
[UAX#15]
Unicode Inc,
Unicode Normalization Forms,
https://www.unicode.org/reports/tr15/,
Version 16.0.0,
August,
2024
.LP
[UAX#44]
Unicode Inc,
Unicode Character Database,
https://www.unicode.org/reports/tr44/,
Version 16.0.0,
August,
2024
.LP
[UAX#29]
Unicode Inc,
Unicode Text Segmentation,
https://www.unicode.org/reports/tr29/,
Version 16.0.0,
August,
2024
.LP
[9atom]
Eric Quanstro,
http://www.9atom.org/,
https://web.archive.org/web/20201111224911/http://www.9atom.org/,
November,
2020