Concatenative topics
Concatenative meta
Other languages
Meta
Advanced
Factor's Unicode library is in the basis/unicode/
directory, along with the encoding support in core/io/encodings/
and basis/io/encodings/
. The library is pretty complete but a few tasks of varying complexity remain to be implemented. We would expect a student to at least attempt all of them over the summer, but depending on the student's skills, only doing the a portion of the tasks would be acceptable.
We want a normalized-stream
type which wraps an underlying character stream, to convert output to normalization form C or normalization form D. Support for normalization is already in unicode.normalization
, but the stream needs to be done.
Unicode defines various code points for digits other than the usual ASCII 0..9. For example, many Indic scripts define their own exact equivalent of 0-9, and these are still used in certain contexts. Parsing numbers with these code points would be useful. Perhaps this could even be integrated with the roman
library for a high-level number parser.
The encoding API for converting strings to byte arrays and vice versa is mostly done (http://docs.factorcode.org/content/article-io.encodings.html)
The Unicode standard specifies complex rules for detecting the boundary between sentences, and the possible and mandatory line breaks. These also have modifications for different locales, some of which are quite complex. For example, to detect possible line breaks in Thai, a dictionary is needed since spaces are not normally used between words, but words should not be broken with a line break. Detecting sentence boundaries is useful for navigation, in a text editor. Detecting line breaks is useful for rendering text.
To support right-to-left scripts like Hebrew and Arabic in Factor's editor widget, Factor needs to use the Unicode Bi-Directional Text Algorithm. This algorithm specifies how left-to-right and right-to-left scripts are mixed.
Many Unicode algorithms, such as collation, word break detection, etc. should act differently in different locales. For example, Swedes sort ö after z, where as Germans sort it before p. The Common Locale Data Repository has information about how these algorithms should be tailored to fit different locales. The CLDR also has information about, for example, standard date formats in different locales, and certain pieces of text which are commonly localized in applications. It would be a huge benefit to Factor applications to have access to this information.
New and old parts of the Unicode library will need to be optimized for performance. In particular, the choices of data structures will have to be reexamined, and possibly a compressed trie implementation will be useful. Generally, data strutures used for Unicode should take advantage of Factor's new capabilities for packed memory structures, which were not available when the original library was written.
The UI and web frameworks could have a system for internationalization and localization where text strings are held in a resource file of some format. These resource files should be used based on user request, and seamlessly substituted in for everything that needs them. The hard part will be designing the right API for using the localized strings, and a good format for the resource files.
The student gains experience with internationalization and localization.
Factor would be better suited to writing applications which deal with non-ASCII text. All of the world's languages, including English, use non-ASCII text. Factor's Unicode support, which is already more advanced than most languages, would be world-class after these changes.
This revision created on Sat, 27 Feb 2010 16:34:43 by littledan