Factor/GSoC/2010/Improve Unicode library

Mentor

Daniel Ehrenberg

Skills required

Knowledge of at least one non-English language would be nice
Non-ASCII, international character sets in general
Reading specifications

Skill level

Advanced

Technical outline

Factor's Unicode library is in the basis/unicode/ directory, along with the encoding support in core/io/encodings/ and basis/io/encodings/. The library is pretty complete but a few tasks of varying complexity remain to be implemented. We would expect a student to at least attempt all of them over the summer, but depending on the student's skills, only doing the a portion of the tasks would be acceptable.

Normalized output streams

We want a normalized-stream type which wraps an underlying character stream, to convert output to normalization form C or normalization form D. Support for normalization is already in unicode.normalization, but the stream needs to be done.

Unicode number input

Unicode defines various code points for digits other than the usual ASCII 0..9. For example, many Indic scripts define their own exact equivalent of 0-9, and these are still used in certain contexts. Parsing numbers with these code points would be useful. Perhaps this could even be integrated with the roman library for a high-level number parser.

Encodings

The encoding API for converting strings to byte arrays and vice versa is mostly done (http://docs.factorcode.org/content/article-io.encodings.html)

The ISO2022 encoding, used for Japanese, is missing.
Implement heuristics to auto-detect encodings. This is always unreliable but useful for client applications, such as an e-mail client, where you can auto-detect the encoding and give the user an option to change it manually.

Line and sentence breaks

The Unicode standard specifies complex rules for detecting the boundary between sentences, and the possible and mandatory line breaks. These also have modifications for different locales, some of which are quite complex. For example, to detect possible line breaks in Thai, a dictionary is needed since spaces are not normally used between words, but words should not be broken with a line break. Detecting sentence boundaries is useful for navigation, in a text editor. Detecting line breaks is useful for rendering text.

BIDI

To support right-to-left scripts like Hebrew and Arabic in Factor's editor widget, Factor needs to use the Unicode Bi-Directional Text Algorithm. This algorithm specifies how left-to-right and right-to-left scripts are mixed.

Tailoring and CLDR

Many Unicode algorithms, such as collation, word break detection, etc. should act differently in different locales. For example, Swedes sort ö after z, where as Germans sort it before p. The Common Locale Data Repository has information about how these algorithms should be tailored to fit different locales. The CLDR also has information about, for example, standard date formats in different locales, and certain pieces of text which are commonly localized in applications. It would be a huge benefit to Factor applications to have access to this information.

Performance

New and old parts of the Unicode library will need to be optimized for performance. In particular, the choices of data structures will have to be reexamined, and possibly a compressed trie implementation will be useful. Generally, data strutures used for Unicode should take advantage of Factor's new capabilities for packed memory structures, which were not available when the original library was written.

Internationalization framework

The UI and web frameworks could have a system for internationalization and localization where text strings are held in a resource file of some format. These resource files should be used based on user request, and seamlessly substituted in for everything that needs them. The hard part will be designing the right API for using the localized strings, and a good format for the resource files.

Value to the student

The student gains experience with internationalization and localization.

Value to the community

Factor would be better suited to writing applications which deal with non-ASCII text. All of the world's languages, including English, use non-ASCII text. Factor's Unicode support, which is already more advanced than most languages, would be world-class after these changes.

This revision created on Sat, 27 Feb 2010 16:34:43 by littledan

Contents