Factor/GSoC/2010/Improve Unicode library
Factor's Unicode library is in the
Normalized output streams
We want a
Unicode number input
Unicode defines various code points for digits other than the usual ASCII 0..9. For example, many Indic scripts define their own exact equivalent of 0-9, and these are still used in certain contexts. Parsing numbers with these code points would be useful. Perhaps this could even be integrated with the
The encoding API for converting strings to byte arrays and vice versa is mostly done (http://docs.factorcode.org/content/article-io.encodings.html)
Line and sentence breaks
The Unicode standard specifies complex rules for detecting the boundary between sentences, and the possible and mandatory line breaks. These also have modifications for different locales, some of which are quite complex. For example, to detect possible line breaks in Thai, a dictionary is needed since spaces are not normally used between words, but words should not be broken with a line break. Detecting sentence boundaries is useful for navigation, in a text editor. Detecting line breaks is useful for rendering text.
To support right-to-left scripts like Hebrew and Arabic in Factor's editor widget, Factor needs to use the Unicode Bi-Directional Text Algorithm. This algorithm specifies how left-to-right and right-to-left scripts are mixed.
Tailoring and CLDR
Many Unicode algorithms, such as collation, word break detection, etc. should act differently in different locales. For example, Swedes sort ö after z, where as Germans sort it before p. The Common Locale Data Repository has information about how these algorithms should be tailored to fit different locales. The CLDR also has information about, for example, standard date formats in different locales, and certain pieces of text which are commonly localized in applications. It would be a huge benefit to Factor applications to have access to this information.
New and old parts of the Unicode library will need to be optimized for performance. In particular, the choices of data structures will have to be reexamined, and possibly a compressed trie implementation will be useful. Generally, data strutures used for Unicode should take advantage of Factor's new capabilities for packed memory structures, which were not available when the original library was written.
The UI and web frameworks could have a system for internationalization and localization where text strings are held in a resource file of some format. These resource files should be used based on user request, and seamlessly substituted in for everything that needs them. The hard part will be designing the right API for using the localized strings, and a good format for the resource files.
Value to the student
The student gains experience with internationalization and localization.
Value to the community
Factor would be better suited to writing applications which deal with non-ASCII text. All of the world's languages, including English, use non-ASCII text. Factor's Unicode support, which is already more advanced than most languages, would be world-class after these changes.
This revision created on Sat, 27 Feb 2010 16:34:43 by littledan
All content is © 2008-2010 by its respective authors. By adding content to this wiki, you agree to release it under the BSD license.