S-Lang Library C Programmer's Guide (v2.3.0): Unicode Support

4. Unicode Support

S-Lang has native support for the UTF-8 encoding of unicode in a number of its interfaces including the the SLsmg screen mangement interface as well as the interpreter. UTF-8 is a variable length multibyte encoding where unicode characters are represented by one to six bytes. A technical description of the UTF-encoding is beyond the scope of this document, and as such the reader is advised to look elsewhere for a more detailed specification of the encoding.

By default, the library's handling of UTF-8 is turned off. It may be enabled by a call to the SLutf8_enable function:


    int SLutf8_enable (int mode)

If the value of mode is 1, then the library will be put in UTF-8 mode. If the value of mode is 0, then the library will be initialized with UTF-8 support disabled. If the value is -1, then the mode will determined through an OS-dependent manner, e.g., for Unix, the standard locale mechanism will be used. The return value of this function will be 1 if UTF-8 support was activated, or 0 if not.

The above function determines the UTF-8 state of the library as a whole. For some purposes it may be desirable to have more fine-grained control of the UTF-8 support. For example, one might be using the jed editor to view a UTF-8 encoded file but the terminal associated with the editor may not support UTF-8. In such a case, one would want the SLsmg interface to be in UTF-8 mode but lower-level SLtt interface to not be in UTF-8 mode. Hence, the following activation functions are also provided:


    int SLsmg_utf8_enable (int mode);
    int SLtt_utf8_enable (int mode);
    int SLinterp_utf8_enable (int mode);

Note that once one of these interface specific functions has been called, any further calls to the umbrella function SLutf8_enable will have no effect on that interface. For this reason, it is best to call SLutf8_enable first before the calling one of the interface-specific functions.

Until support for Unicode is more widespread among users, it is expected that most users will still be using a national character set such as ASCII or iso-8869-1. For example, iso-8869-1 is a very widespread character set used on Usenet. As a result, applications will still have to provide support for such character sets. Unfortunately there appears to be no best way to do this.

For the most part, the UTF-8 support should be largely transparent to the user. For example, the interpreter treats all multibyte characters as a single character which means that the user does not have to be concerned about the internal representation of a character. Rather one must keep in mind the distinction between a character and a byte.

Next Previous Contents