Factor/GSoC/2009/Improve Unicode library

Skills required

Knowledge of at least one non-English language would be nice
Non-ASCII, international character sets in general
Reading specifications

Skill level

Advanced

Technical outline

Factor's Unicode library is in the basis/unicode/ directory, along with the encoding support in core/io/encodings/ and basis/io/encodings/. The library is pretty complete but a few tasks of varying complexity remain to be implemented. We would expect a student to at least attempt all of them over the summer, but depending on the student's skills, only doing the easy tasks would still be acceptable.

Easy tasks

Normalized output streams

We want a normalized-stream type which wraps an underlying character stream, to convert output to normalization form C or normalization form D. Support for normalization is already in unicode.normalization, but the stream needs to be done.

Unicode number input

Unicode defines various code points for digits other than the usual ASCII 0..9. Parsing numbers with these code points would be useful. Perhaps this could even be integrated with the roman library for a high-level number parser.

Encodings

The encoding API for converting strings to byte arrays and vice versa is mostly done (http://docs.factorcode.org/content/article-io.encodings.html)

The ISO2022 encoding, used for Japanese, is missing.
Implement heuristics to auto-detect encodings. This is always unrealiable but useful for client applications, such as an e-mail client, where you can auto-detect the encoding and give the user an option to change it manually.

Complex tasks

Line and sentence breaks

The algorithm is detailed in the Unicode 5.1 specification.

BIDI

Support for bidirectional text

CLDR

What do we need from here?

Tailoring

Collation tailoring, break tailoring

Other

Performance/cleanup/better data structures

Value to the student

The student gains experience with internationalization and localization.

Value to the community

Factor's Unicode support, which is already more advanced than most languages, would be world-class after these changes.

This revision created on Fri, 13 Mar 2009 04:13:31 by slava

Contents