git: 9front

--- a/sys/doc/mkfile

+++ b/sys/doc/mkfile

@@ -41,6 +41,7 @@

 	nssec\

 	fossil\

 	backup\

+	unicode\

 ALLPS=${ALL:%=%.ps}

 HTML=${ALL:%=%.html} release3.html release4.html

@@ -75,7 +76,7 @@

 # troff gets some scary-looking errors but they're okay

 &.ps:D:	&.ms

 	mac=(-ms)

-	if(~ $stem comp utf 9 contents) mac=(-ms -mnihongo)

+	if(~ $stem comp utf 9 contents unicode) mac=(-ms -mnihongo)

 	{ echo $FONTS; cat $stem.ms } | pic | tbl | eqn |

 		troff $mac | lp -dstdout > $target

 	cleanps $target

@@ -82,7 +83,7 @@

 %.trout:D:	%.ms

 	mac=(-ms)

-	if(~ $stem comp utf 9 contents) mac=($mac -mnihongo)

+	if(~ $stem comp utf 9 contents unicode) mac=($mac -mnihongo)

 	{ echo $FONTS; cat $stem.ms } | pic | tbl | eqn |

 		troff $mac > $target

--- /dev/null

+++ b/sys/doc/unicode.ms

@@ -1,0 +1,696 @@

+.HTML "Modern Unicode Requires Modern Solutions"

+.TL

+Modern Unicode Requires Modern Solutions

+.AU

+Jacob Moody

+.AB

+Plan 9 has a history with Unicode, the authors of which were

+directly involved with the creation and first system-wide

+implementation of UTF-8. This should come as no surprise with the

+focus on plaintext that permeated UNIX, and how that focus had carried

+over into Plan 9. However, the world around Unicode has changed much

+since its early versions, becoming more complex and detail

+oriented as the version number climbs higher. While Plan 9 from Bell

+Labs was never updated to any major version of Unicode beyond the first,

+9front has made incremental progress over the years in adopting newer

+standards. We aim to evaluate the current state of the union with

+Unicode support, and different natural languages at large, within the

+current 9front system, both where the system is at now and where we

+would like it to go.

+.AE

+.EQ

+delim $$

+.EN

+.SH

+Background

+.PP

+.FS

+First presented in a slightly different form at the 10th International Workshop on Plan 9.

+.FE

+.LP

+Computers do not handle abstract notions such as natural

+language particularly well, so we rely on standards for representing natural

+language as a sequence of numbers. The mapping of

+numbers to natural language characters is known as a character

+encoding. While there have been many character encodings

+invented and used over the years, the one that has seen the most

+widespread adoption now is Unicode. Unicode is unique among other

+character encodings before it in the fact that it plans to have a

+single encoding for all natural languages, done in such a way that a

+single document could contain language samples from all natural

+languages without out of band code page switching or the like (as was

+common before Unicode).

+.LP

+With Unicode's goal of serving every natural language and the

+vast differences between natural languages there comes a need for

+careful definition of terms that the standard uses[Glossary].

+To ensure some clarity let us get some of these definitions out in front:

+.TS

+l l

+^ ^

+l l

+l l

+l l

+^ ^.

+Abstract Character	A unit of information used for the organization, control,

+	or representation of textual data.

+Codepoint	Any value within the Unicode codespace. In convention with

+	traditional plan 9 terminology, we use rune as a synonym for codepoint.

+Encoded Character	A codepoint which has been mapped to an abstract character

+Grapheme	A minimally distinctive unit of writing in the context of a particular

+	writing system

+Glyph	The actual, concrete image of a glyph representation having been

+	rasterized or otherwise imaged onto some display surface.

+.TE

+.LP

+With that being said, let us define how these terms relate. Broadly

+abstract characters are information that we wish to encode in the

+character encoding, this is everything from graphemes, to grammar

+markings, to components of a letter, to informational control

+characters. Unicode gives us a large bit space to work with, so a

+codepoint is a single index into that bitspace, the standard then

+gives us the mapping for relating these codepoints to abstract

+characters. A large majority (but not all) of encoded characters

+represent graphemes. Graphemes are just one possible category

+that an encoded character can be within. Unicode only really defines these

+encoded characters, the rendering and presentation of these encoded

+characters is the responsibility of the rendering layer who takes

+encoded characters as input and outputs glyph images.

+.LP

+The code space of Unicode is currently 21 bits, and with computing

+hardware mostly rounding that up to 32 bits, that can be a non-trivial

+amount of storage for large amounts of text. For this the Unicode

+standard also defines a few compact encoding schemes, the most

+popular of which is

+.CW UTF-8 ,

+encoding the Unicode bit space within a variable number of bytes up to four.

+.LP

+Unicode also provides multiple codepoints (or a sequence of

+codepoints) for representing the same abstract character. A large

+chunk of these characters are those which have been identified as

+being composed out of two or more subcomponents. The most common

+example of this are characters that have diacritical markings, which

+can either be represented as one codepoint for the "precomposed"

+variant or as a series of codepoints with the base codepoint first

+followed by codepoints which denote individual diacritical markings.

+.LP

+Considering that there are multiple different representations of the

+same abstract character, there is a method to normalize a series of

+codepoints into a consistent representation for semantic comparisons.

+The Unicode standard[UAX#15] does define a process for conducting this

+normalization, and provides 4 different forms to normalize to:

+.TS

+l l.

+NFC	Precomposed codepoints

+NFD	Decomposed codepoints

+NFKC	Precomposed compatibility codepoints

+NFKD	Decomposed compatibility codepoints

+.TE

+.LP

+The compatibility forms here do additional conversion of special

+formatted character sets such as codepoints for specific fonts,

+rotated variants, and so on into their base

+counterparts. This operation loses some information regarding the

+original codepoints thus is separated from typical normalization. There

+are some additional rules that are part of normalization so as to

+enable checking the semantic equivalence between any two sequence of codepoints.

+.SH

+Existing Unicode in Plan 9

+.LP

+Plan 9 was an early adopter of Unicode, but never reached full

+compliance for any standard version of Unicode during its history.

+Pike et al put in a lot of work for handling UTF-8

+consistently within the system[Pike93] and a majority of the system was

+updated to handle codepoints. Plan 9's initial support could

+be defined succinctly as supporting "21-bit ASCII encoded as \f(CWUTF-8\fR",

+where each codepoint is assumed to be exactly one abstract

+character and interfaces broadly assume that mutations to text (ie

+backspace in a text editor and the like) operate only on single code

+points as well. The result of this is that the system can only really

+handle precomposed Unicode in any sort of ergonomic way, however Plan

+9 is very consistent in this level of support and has been ever since its

+last official release.

+.LP

+Plan 9 had initially only been written to support Unicode version 1.0

+(16-bit code space), so one of the earliest contributions from the community

+was to update this to the modern 21-bit code space. This work was first

+done in 9atom[9atom], and was later done in 9front. This includes

+work to make more of the system Unicode aware that wasn't

+previously. The version of awk which ships with the system is a notable

+example. There were also some smaller changes like updates to the

+input "compose" sequences to allow a wider range of codepoints and

+allowing users to manually input them by typing out their

+direct value in hexadecimal. However even with that work there

+have been portions of 9front that had yet to still be updated past

+Unicode version 1.0.

+.SH

+Using Tables to Keep Pace

+.LP

+The work being presented in this paper largely started with wanting to

+update some of the existing interfaces within the system, namely the

+.CW isalpharune (2)

+set of functions, in order to not only bring them up to date,

+but also allow their adaptation to future versions of the standards.

+To that end, the Unicode consortium publishes a

+series of informational databases[UAX#44] in an awk-friendly format. Plan 9 users

+may be familiar with one of these data files as the original

+.CW /lib/unicode

+was a cut-down and reformatted version of the Unicode

+Character Database (UnicodeData.txt). These files are updated and

+maintained as the Unicode versions change, so the natural course of

+action seemed to be to read these files as input to generate code that

+could be then used to implement whatever functions we may need.

+.LP

+To better illustrate the input data, below is an excerpt of UnicodeData.txt

+for U+2468. Each line of UnicodeData.txt corresponds to a single codepoint,

+with various properties of said codepoint separated by semicolons. Both

+the keys and values for each field are documented by the Unicode standard[UAX#44].

+.P1

+; look -b 16 2468 /lib/ucd/UnicodeData.txt

+2468;CIRCLED DIGIT NINE;No;0;ON;<circle> 0039;;9;9;N;;;;;

+.P2

+.LP

+The parsing of these text input files is largely unremarkable, we read

+in one line at a time and denote properties and conversions as we move

+ourselves along. The challenge comes in picking and generating a

+reasonable data structure for the function implementations to use,

+balancing both the size of the data structure in memory along with the

+runtime complexity for walking the data structure. Plan 9 binaries

+are currently quite small in comparison to the hardware available, so

+we had decided to err on the side of increasing the binaries size

+in exchange for better performance.

+Because of this we decided to implement direct lookup

+tables, using codepoints as an index and having an O(1) lookup time

+for the vast majority of keys.

+.LP

+With that said, multiple arrays each with

+$2 sup 21$

+entries is quite

+untenable even given the potential performance savings. To mitigate

+this we can use properties of the distribution of abstract characters

+over the Unicode code space to our advantage. Unicode code space is

+distributed up into chunks, with natural language roughly being

+contained within their own contiguous range of codepoints. This

+means that for many of the properties we care about there are hotspots

+and coldspots throughout the code space. As an example, you could

+consider the

+.CW toupperrune ()

+function, which largely is only applicable

+to natural languages which feature distinct letter cases. Many

+eastern Asian languages have no concept of casing so those regions

+of the Unicode code space don't need defined values. This makes

+the lookup table quite sparse, and therefore a good candidate

+for some method of compression. This is not unique to letter

+casing, and in fact all of the lookup tables for

+.CW isalpharune (2)

+end up being sparse tables.

+.LP

+To compress the lookup table, we use a data structure that [Gillam02]

+referred to as a compact array. We start by breaking up the lookup

+table into equally sized segments, then walk those segments and find

+any overlap (full or partial) between them. We can then remove any of

+the overlapped entries within the table. In order to access these

+overlapped segments we then need a higher level lookup table which

+points into these segments of the lower table. In order to use a

+single key (a codepoint in our case) for indexing in to both tables,

+we split up the bits within the key into two different portions, the

+top bits indexing in to the top table, the result of which is added to

+the bottom bits in order to index in to the bottom table.

+.LP

+Given how large the code space is for Unicode, it becomes beneficial

+to do not just one layer of compacting, but two, where the higher

+lookup table is compacted with the same method, resulting in a new top

+level lookup table. This splits up the codepoint into three parts,

+the result of the top level table being used to index the

+middle table which is then used to index the bottom table which

+contains our desired value. The following code illustrates an example

+of this lookup method:

+.P1

+#define upperindex1(x) (((x)>>(4+7))&0x3FF)

+#define upperindex2(x) (((x)>>4)&0x7F)

+#define upperoffset(x) ((x)&0xF)

+#define upperlkup(x) (_upperdata[_upperidx2[_upperidx1[upperindex1(x)]

+	+ upperindex2(x)] + upperoffset(x)])

+.P2

+In this example, the top ten bits of the codepoint index the top level

+index, the result of which is combined with the next 7 bits of the

+codepoint, with that result of that being added to the remaining 4

+bits of the codepoint. Which amount of bits to use for each section is

+not exactly clear, so we brute forced the smallest table size through

+testing all possible values.

+.LP

+Of course in order for this to be effective we need to ensure that the

+values of our lookup table are sparse. For simple property values this

+is typically already the case, and for functions like

+.CW toupperrune ()

+we can instead store a delta between the current rune and the desired

+rune, which gives us nice runs of zero values for anything without a

+defined transformation. Using this method we've gotten the following

+results for the

+.CW isalpharune (2)

+set of functions:

+.TS

+l l

+l n.

+Table	Size (bytes)

+toupper	12436

+tolower	11754

+totitle	12180

+is*rune	24184

+total	60554

+.TE

+.LP

+The

+.CW is*rune ()

+table here merges all of the different categories into

+one table as a bit field, one bit per possible category, so the result

+is less sparse than the other tables. We find the expense of 60KB to

+be well worth the benefits gained from the ease of upgrade, and

+constant time lookup speed.

+.SH

+Keeping Composure

+.LP

+Recall that normalization is the process of transforming a series of

+Unicode codepoints into a canonical variant for a given target format.

+In order to do this, we'll need tables that map individual codepoints

+to their transformed values for each format, their composed or decomposed

+variants. We can use tables largely similar to the ones described above, with

+some extras where needed to handle more specific edge cases.

+.LP

+For decomposition our lookup key is the same as before, the codepoint

+of the rune we wish to decompose. However the value of the table gets

+a bit more complex. The most straightforward approach would be to

+combine the two new runes as one value, but this is fairly painful

+for the table size. We'd need a 42-bit value, which would have to be

+rounded up to 64. Not only are we wasting over a third of the storage

+space due to the rounding, but using 64-bit values would be quite

+onerous for our 32-bit systems. As it turns out however, there is a

+heavy bias for the new decomposed runes to reside within the lower 16

+bit code space. In the latest version of Unicode at the time of

+writing (16.0) there are only 158 runes that result in decompositions

+above the 16-bit code space. It becomes easiest then to have our

+compact array use 32-bit values (two 16-bit codepoints), and then

+have an exceptions table which stores those 158 runes.

+The base table values for these exception runes are instead given

+within a small portion of the Unicode private space (0xEEEE on up),

+with the difference from the start of the exception being used to index the list of exceptions.

+This allows us to maintain our constant time lookups, even when

+accessing members of the exception range.

+.LP

+Recomposition required a different approach with similar constraints

+as the decomposition. Instead of packing two runes into the output of

+the table we instead need to pack two runes into the input of our

+lookup, and produce one rune as output. The bias towards the 16-bit

+code space remains, so we can limit our main lookup table to taking

+two 16-bit runes as input and outputting a single 16-bit rune.

+For exceptions, (where either the input runes or output runes exceed 16 bits)

+we use a similar exception table. However in this case we have to search this

+linearly because otherwise we would need to increase the table size to accomodate

+the larger lookup keys. The differences from decomposition make it much harder to make use of

+our existing lookup table, so we instead opted to implement a hash

+table.

+.LP

+A hash table is going to have a worse average lookup time compared to

+our compact arrays (assuming moderate collision), but stays within the

+same ballpark of performance and size as our compact array. Because

+we're generating this table at build time instead of at run time, we have

+to find a way of embedding the entire structure in a contiguous block

+of memory. This is easy to accomplish for our first result of the

+hash function, where we have one large chunk of memory: we hash our

+input key then modulo it by the size of our initial memory buffer.

+For collisions however we need to find some way of chaining the

+results in a way that can be searched until a key matches. For this we

+create another array that is appended onto the first.

+The main entry points are chained to the

+collisions by embedding a numeric offset from the start of the

+collision space. This results in a hash table entry that looks like the

+following:

+.P1

+struct KV {

+	uint key;

+	ushort val;

+	ushort next;

+};

+.P2

+In the interest of saving as much space as possible for the generated

+array, we can compact this struct into a series of paired integers, the

+first being the merger of the two 16-bit codepoints, and the second

+being the merger of the value and optionally the offset to the collision space.

+.LP

+As with choosing the constants for compact arrays, we found an

+optimal size for the original hash array versus the collision array

+largely through brute force. Given the Unicode standard at the time

+of writing, we are using 512 initial buckets, which forces 555 entries

+into the collision array. With a hash function that provides a

+reasonable equal distribution, this should average to roughly only one

+collision per entry. Using these two data structures results in

+the following generated table sizes:

+.TS

+l l

+l n.

+Table	Size (bytes)

+decomp	20108

+recomp	8536

+.TE

+.SH

+Normalizing Theory

+.LP

+With composition handled we can then start to tackle normalizing a

+string to one form or the other. The implementation of Unicode

+normalization can be quite finicky and there are plenty of

+considerations for how the input must be fed to result in conformant

+output. However before we can get into those, we need some additional

+tables that will be used in the normalization process.

+.LP

+The first additional table we need is one that stores the

+.I "Combining Character Class"

+.CW (ccc)

+of a codepoint, this is, the value which identifies

+the attaching property of the character. A value of non-zero

+signifies that the codepoint is an attacher, the specific value

+denoting the group of attachers that it belongs to.

+Typically only one attacher out of each group may be considered

+for attachment to a base codepoint.

+The ccc value is also used to ensure the ordering of attaching

+codepoints is consistent in

+.CW NFD

+formatted strings[UAX#15].

+The other value we need is the Quick Check (qc)

+value of a codepoint which is used to determine where a normalization

+context ends in certain situations. We'll elaborate on this further

+in the following section. Our existing compact

+array structure works well enough for these lookup tables as is,

+resulting in the following sizes:

+.TS

+l l

+l n.

+Table	Size (bytes)

+ccc	9738

+qc	7878

+.TE

+.LP

+The abstract logic for normalization is described in [UAX#15] and is

+as follows:

+.P1

+1. Decompose every rune possible in the string until it

+   can no longer be decomposed

+2. For codepoints without a ccc of zero, conduct a

+   stable sort of them based on their ccc value

+3. If we are normalizing to NFC, recursively recompose code

+   points in the string until there is nothing more to recompose

+.P2

+.LP

+The main challenge for normalization then is designing an algorithm

+that can do all of this on a single pass of the string. To accomplish

+this, we iterate through a string and accumulate runes into a context.

+A context conceptually aims to localize the effect of normalization, breaking

+it down into chunks, such that a chunk can be processed independantly.

+This allows very large strings to be processed piecewise and within

+reasonable memory boundaries. We implemented the context by using a stack,

+which is cleared and reused for multiple contexts as we parse through the string.

+The process by which we construct the first part of our context is as follows:

+.P1

+1. Take the next rune from the string and push it onto our stack

+2. If the following rune has a ccc not equal to zero,

+   push it on the stack. Repeat until we encounter a rune with a

+   ccc of zero.

+3. Perform a stable sort of the stack by the ccc value of the codepoints

+.P2

+.LP

+At this point if we are normalizing to a decomposed form (NFD) our context

+is finished and may be flushed. Continuing the process again for the next

+rune in the string.

+.LP

+If we're recomposing, we now must classify which type of context that we have

+before we continue.

+Here we split into one of two recomposition strategies based on which

+type of context we have. Strategy A is used for sequences of codepoints

+which have attachers (ccc of non-zero), and strategy B for sequences

+of composable base runes (ccc of zero).

+.LP

+.P1

+ccc of non zero/Strategy A

+1.  Use the base codepoint as the left hand side input to our

+    recomposition

+2.  Use the next immediate codepoint as the right hand side input

+    to our recomposition.

+2a. If a match is found, replace the base rune with the new result

+    and discard the right hand size codepoint. The new codepoint will

+    be used as the base codepoint for further iterations.

+2b. If a match isn't found, skip through all other codepoints

+    in the stack that have the same ccc as the failed match.

+3.  Return to step 1 with and continue to attempt matches with

+    remaining codepoints in the stack until the stack is depleted.

+.P2

+.LP

+.P1

+ccc of zero/Strategy B

+1.  Continue to pull codepoints on to the stack from the input

+    string until we reach a boundary, as described by the

+    Quick Check value.

+2.  Walk through each codepoint in the stack, attempting to

+    compose it with the codepoint immediately to the right of it.

+2a. If a match is found, replace the left hand side codepoint with

+    the result, and discard the right hand side codepoint.

+2b. If you don't find a match, continue through the stack with

+    the new left hand side being the current right hand side.

+3.  Go back to step 1 if any recomposition was found until we make a

+    pass without any recompositions.

+.P2

+.LP

+After either of these two steps the resulting stack can be flushed out

+as normalized output. The stipulation to bypass other runes which have

+the same ccc (step 2b in strategy A) is stipulated by the standard[UAX#15]

+and required for a conformant implementation.

+.LP

+There does exist an edge case in normalization, as the standard itself

+puts no limit on the amount of attaching codepoints that may be given to a

+base codepoint. This can be seen in what is typically

+referred to as "zalgo text", where a single base codepoint is given

+a large amount of attachers to produce a "glitchy" appearence.

+While there is no natural language which

+relies on a large number of attaching codepoints, it remains an edge

+case our algorithm must come to terms with. In order to perform this

+normalization within a bounded memory constraint we need to pick some

+reasonable maximum amount of attaching codepoints that we plan to support.

+Unicode defined a subset of normalized text referred to as

+"Stream-Safe text"[UAX#15] which dictates that no more than 30 attaching code

+points may be given to a base codepoint, and for sequences larger than that,

+attaching codepoints can be broken up with

+.CW U+034F

+(Combining Grapheme Joiner). We implement this recommendation, which means our

+algorithm will insert a CGJ into a text stream if we encounter large

+enough runs. This is required if we want to output properly normalized

+text, as without it the ccc context would grow to a length that would

+be untenable to sort. The CGJ acts as a new base, and codepoints

+which attach to it can be sorted independently of those on the base

+codepoint. The rendering layer should be able to interpret the

+attaching codepoints on the CGJ as part of the previous base code

+point.

+.SH

+Normalization Interface

+.LP

+With the general algorithm for normalization figured out we need to

+decide on an interface for exposing this to other programs. Due to the

+nature of normalization this can be somewhat complicated to do well.

+As illustrated earlier, context expansion in both types of context

+continues until we find the beginning of the next context. In strategy

+A it is until we find a ccc of zero and in strategy B it is until we

+find qc boundary. This means that for most normalization we need to

+know when an input ends in order to flush the last context, since

+there is no next context to act as our terminator. This is relatively

+easy if we're operating on the entire string at once: we can walk

+until we reach the end of the input string and use that as a sign to

+flush whatever remaining context we may be holding on to. If we're

+dealing with a stream of input, from a file/pipe/network connection,

+then we may wish to normalize one chunk at a time. We then need to maintain

+the remaining context of one chunk into the beginning of the next

+chunk. Other implementations handle this issue by implementing a

+concatenation function, which carefully conjoins one normalized string

+to the other with only contextualizing the edges. We decided to

+instead implement only a streaming interface, which can be used for

+both concatenation and streaming input. While pushing two strings

+that had been independently normalized back through the streaming interface

+is not as efficient as only looking at the boundaries, we think that

+the normalization implementation itself should be sufficiently fast to

+have a negligible performance impact in a majority of cases.

+.LP

+The following streaming interface is what we had settled on:

+.P1

+typedef struct Norm Norm;

+struct Norm {

+	... /* private */

+};

+extern void norminit(Norm*, int comp, void *ctx, long (*get)(void*));

+extern long normpull(Norm*, Rune *dst, long sz, int flush);

+.P2

+.LP

+In addition we provide the following wrapper functions which target

+normalization for smaller fixed set inputs:

+.P1

+extern long runecomp(Rune *dst, long ndst, Rune *src, long nsrc);

+extern long runedecomp(Rune *dst, long ndst, Rune *src, long nsrc);

+extern long utfcomp(char *dst, long ndst, char *src, long nsrc);

+extern long utfdecomp(char *dst, long ndst, char *src, long nsrc);

+.P2

+.LP

+With the streaming interface: the

+.CW flush

+flag in

+.CW normpull ()

+specifies whether we should preserve the remaining context once we reach an EOF from a

+.CW get ()

+function. This allows callers to normalize fixed chunks

+at a time, then call with the flush argument set to non-zero when

+there is no more input to give. The

+.CW ctx

+is a caller supplied pointer which is also passed forward on to the

+.CW get ()

+function. If

+.CW Comp

+is non-zero the string is composed, meaning the output is

+.CW NFC

+instead of

+.CW NFD .

+.LP

+The Unicode standard publishes an extensive set of tests[UAX#44] to ensure

+that normalization implementations deal with all of the possible edge

+cases which show up within expected realistic input. We verified

+our implementation against this series of tests, and added support

+for normalized strings to

+.CW tcs (1).

+.SH

+Preparation and Future Work

+.LP

+While normalization itself is not within the crucial path to

+generating a better conforming Unicode interface, the necessary infrastructure

+for implementing normalization provides the foundation for implementing more comformant

+interfaces.

+For this paper we want to focus on what we think is

+the next step, which is a more conformant interface for Plan 9

+programs which use

+.CW frame(2).

+This would be

+.CW rio (1),

+.CW acme (1),

+and

+.CW samterm (1).

+.LP

+As implemented today, these programs assume that each codepoint is its

+own abstract character and extra attaching codepoints are treated as

+individual, separate characters. Instead of attaching combining

+characters to their base codepoint rio, sam and friends will render

+them on their own, resulting in a dangling diacritic in the best cases

+and a stray PJW in the worst. This results in decomposed text being

+largely unusable within these interfaces. If we would like these

+programs to handle such text better, we need interfaces that allow it

+to find logical breaks within the text. Unicode defines[UAX#29] two

+types of breaks, grapheme breaks and word breaks.

+.LP

+Grapheme breaks are boundaries between codepoints in which a user may

+consider it to be one logical abstract character. The codepoints

+between these breaks (referred to as a grapheme cluster)

+should be considered as one abstract character by editors.

+For example, an editor may choose to interpret a backspace to be the

+deletion of codepoints up until the previous grapheme break. Likewise

+an editor should not put a line break within a grapheme cluster.

+Another consideration may be whether

+.CW sam (1)'s

+structured regular expression should be changed to interpret grapheme

+clusters as one logical character or as independent codepoints.

+.LP

+Word breaks are similar to grapheme breaks, but for exclusively

+finding where in a series of codepoints an interface may choose to

+place a line break. This prevents cases like placing a line break

+halfway through a series of numerics, or in general places which might

+confuse a native speaker when reading the text. Of course it is not

+always possible to abide by these rules, as the width that an editor is

+working with may be shorter than the full context of a word break

+(consider a file full of 1000 1's), so at best it serves as an

+optimistic placement. While nice to have for text editors, we

+believe that a more important application lies in typesetting documents.

+.LP

+The algorithms for finding these breaks are less involved than

+normalization, and we were able to again use our compact arrays to

+store the required lookup information for codepoints. As part of this

+paper we implement the following set of functions for finding grapheme

+and word breaks within an input string:

+.P1

+Rune*  runegbreak(Rune *s);

+Rune*  runewbreak(Rune *s);

+char*  utfgbreak(char *s);

+char*  utfwbreak(char *s);

+.P2

+The return values of these functions point to the beginning of the

+next context.

+.LP

+Of course these functions are only the start of implementing proper

+support in our editors and interfaces. Future work may look to

+reimplement

+.CW frame (2)

+to make use of these functions for implementing editors that work

+well in languages other than English. We have so far also

+largely ignored the rendering portions of these interfaces. It is

+another important step required for conformance but the rules for

+rendering extends beyond what is specified in Unicode and is best

+saved for a future paper which explores the full complexity of that

+problem space.

+.SH

+Acknowledgements

+.LP

+Special thanks to Ori Bernstein, cinap_lenrek, and the rest of the

+9front community for their support in designing and thinking about

+this problem space. The work as it is presented would have not been

+possible without their advice and review.

+.SH

+References

+.LP

+[Gillam02]

+Richard Gillam,

+``Unicode Demystified: A Practical Programmer's Guide to the Encoding Standard'',

+.IR "Addison-Wesley Professional" ,

+January,

+2002

+.LP

+[Glossary]

+Unicode Inc,

+Glossary of Unicode Terms,

+https://www.unicode.org/glossary/,

+.LP

+[Pike93]

+Rob Pike, Ken Thompson,

+``Hello World or Καλημέρα κόσμε or \f(Jpこんにちは 世界\fP'',

+January,

+1993

+.LP

+[UAX#15]

+Unicode Inc,

+Unicode Normalization Forms,

+https://www.unicode.org/reports/tr15/,

+Version 16.0.0,

+August,

+2024

+.LP

+[UAX#44]

+Unicode Inc,

+Unicode Character Database,

+https://www.unicode.org/reports/tr44/,

+Version 16.0.0,

+August,

+2024

+.LP

+[UAX#29]

+Unicode Inc,

+Unicode Text Segmentation,

+https://www.unicode.org/reports/tr29/,

+Version 16.0.0,

+August,

+2024

+.LP

+[9atom]

+Eric Quanstro,

+http://www.9atom.org/,

+https://web.archive.org/web/20201111224911/http://www.9atom.org/,

+November,

+2020

--

⑨