S-Lang has native support for the UTF-8 encoding of unicode in a number of its interfaces including the the SLsmg screen mangement interface as well as the interpreter. UTF-8 is a variable length multibyte encoding where unicode characters are represented by one to six bytes. A technical description of the UTF-encoding is beyond the scope of this document, and as such the reader is advised to look elsewhere for a more detailed specification of the encoding.
By default, the library's handling of UTF-8 is turned off. It may
be enabled by a call to the SLutf8_enable
function:
int SLutf8_enable (int mode)
If the value of mode
is 1, then the library will be put in
UTF-8 mode. If the value of mode
is 0, then the library will
be initialized with UTF-8 support disabled. If the value is -1,
then the mode will determined through an OS-dependent manner, e.g.,
for Unix, the standard locale mechanism will be used. The return
value of this function will be 1 if UTF-8 support was activated, or
0 if not.
The above function determines the UTF-8 state of the library as a whole. For some purposes it may be desirable to have more fine-grained control of the UTF-8 support. For example, one might be using the jed editor to view a UTF-8 encoded file but the terminal associated with the editor may not support UTF-8. In such a case, one would want the SLsmg interface to be in UTF-8 mode but lower-level SLtt interface to not be in UTF-8 mode. Hence, the following activation functions are also provided:
int SLsmg_utf8_enable (int mode);
int SLtt_utf8_enable (int mode);
int SLinterp_utf8_enable (int mode);
Note that once one of these interface specific functions has been
called, any further calls to the umbrella function
SLutf8_enable
will have no effect on that interface. For
this reason, it is best to call SLutf8_enable
first before
the calling one of the interface-specific functions.
Until support for Unicode is more widespread among users, it is expected that most users will still be using a national character set such as ASCII or iso-8869-1. For example, iso-8869-1 is a very widespread character set used on Usenet. As a result, applications will still have to provide support for such character sets. Unfortunately there appears to be no best way to do this.
For the most part, the UTF-8 support should be largely transparent to the user. For example, the interpreter treats all multibyte characters as a single character which means that the user does not have to be concerned about the internal representation of a character. Rather one must keep in mind the distinction between a character and a byte.