The S-Lang C Library Reference: Functions dealing with UTF-8 encoded strings

1. Functions dealing with UTF-8 encoded strings

1.1 SLutf8_skip_char

Synopsis: Skip past a UTF-8 encoded character
Usage: SLuchar_Type *SLutf8_skip_char (SLuchar_Type *u, SLuchar_Type *umax)
Description: The SLutf8_skip_char function returns a pointer to the character immediately following the UTF-8 encoded character at u. It will make no attempt to examine the bytes at the position umax and beyond. If the bytes at u do not represent a valid or legal UTF-8 encoded sequence, a pointer to the byte following u will be returned.
Notes: Unicode combining characters are treated as distinct characters by this function.
See Also: SLutf8_skip_chars, SLutf8_bskip_char, SLutf8_strlen

This functions attempts to skip forward past num UTF-8 encoded characters at u returning the actual number skipped via the parameter dnum. It will make no attempt to examine bytes at umax and beyond. Unicode combining characters will not be counted if ignore_combining is non-zero, otherwise they will be treated as distinct characters. If the input contains an invalid or illegal UTF-8 sequence, then each byte in the sequence will be treated as a single character.

See Also

SLutf8_skip_char, SLutf8_bskip_chars

1.3 SLutf8_bskip_char

Synopsis: Skip backward past a UTF-8 encoded character
Usage: SLuchar_Type *SLutf8_bskip_char (SLuchar_Type *umin, SLuchar_Type *u)
Description: The SLutf8_bskip_char skips backward to the start of the UTF-8 encoded character immediately before the position u. The function will make no attempt to examine characters before the position umin. UTF-8 combining characters are treated as distinct characters.
See Also: SLutf8_bskip_chars, SLutf8_skip_char

1.4 SLutf8_bskip_chars

Synopsis

Skip backward past a specified number of UTF-8 encoded characters

Usage

SLuchar_Type *SLutf8_bskip_chars (umin, u, num, dnum, ignore_combining)


   SLuchar_Type *umin, *u;
   unsigned int num;
   unsigned int *dnum;
   int ignore_combining;

Description

This functions attempts to skip backward past num UTF-8 encoded characters occurring immediately before u. It returns the the actual number skipped via the parameter dnum. No attempt will be made to examine the bytes occurring before umin. Unicode combining characters will not be counted if ignore_combining is non-zero, otherwise they will be treated as distinct characters. If the input contains an invalid or illegal UTF-8 sequence, then each byte in the sequence will be treated as a single character.

See Also

SLutf8_skip_char, SLutf8_bskip_chars

1.5 SLutf8_decode

Synopsis

Decode a UTF-8 encoded character sequence

Usage

SLuchar_Type *SLutf8_decode (u, umax, w, nconsumedp


   SLuchar_Type *u, *umax;
   SLwchar_Type *w;
   unsigned int *nconsumedp;

Description

The SLutf8_decode function decodes the UTF-8 encoded character occurring at u and returns the decoded character via the parameter w. No attempt will be made to examine the bytes at umax and beyond. If the parameter nconsumedp is non-NULL, then the number of bytes consumed by the function will be returned to it. If the sequence at u is invalid or illegal, the function will return NULL and with the number of bytes consumed by the function equal to the size of the invalid sequence. Otherwise the function will return a pointer to byte following encoded sequence.

See Also

SLutf8_decode, SLutf8_strlen, SLutf8_skip_char

1.6 SLutf8_encode

Synopsis

UTF-8 encode a character

Usage

SLuchar_Type *SLutf8_encode (w, u, ulen)


   SLwchar_Type w;
   SLuchar_Type *u;
   unsigned int ulen;

Description

This function UTF-8 encodes the Unicode character represented by w and stored the encoded representation in the buffer of size ulen bytes at u. The function will return NULL if the size of the buffer is too small to represent the UTF-8 encoded character, otherwise it will return a pointer to the byte following encoded representation.

Notes

This function does not null terminate the resulting byte sequence. The function SLutf8_encode_null_terminate may be used for that purpose.

To guarantee that the buffer is large enough to hold the encoded bytes, its size should be at least SLUTF8_MAX_BLEN bytes.

The function will encode illegal Unicode characters, i.e., characters in the range 0xD800-0xFFFF (the UTF-16 surrogates) and 0xFFFE-0xFFFF.

See Also

SLutf8_decode, SLutf8_encode_bytes, SLutf8_encode_null_terminate

1.7 SLutf8_strlen

Synopsis: Determine the number of characters in a UTF-8 sequence
Usage: unsigned int SLutf8_strlen (SLuchar_Type *s, int ignore_combining)
Description: This function may be used to determine the number of characters represented by the null-terminated UTF-8 byte sequence. If the ignore_combining parameter is non-zero, then Unicode combining characters will not be counted.
See Also: SLutf8_skip_chars, SLutf8_decode

1.8 SLutf8_extract_utf8_char

Synopsis

Extract a UTF-8 encoded character

Usage

SLuchar_Type *SLutf8_extract_utf8_char (u, umax, buf)


   SLuchar_Type *u, *umax, *buf;

Description

This function extracts the bytes representing UTF-8 encoded character at u and places them in the buffer buf, and then null terminates the result. The buffer is assumed to consist of at least SLUTF8_MAX_BLEN+1 bytes, where the extra byte may be necessary for null termination. No attempt will be made to examine the characters at umax and beyond. If the byte-sequence at u is an illegal or invalid UTF-8 sequence, then the byte at u will be copied to the buffer. The function returns a pointer to the byte following copied bytes.

Notes

One may think of this function as the single byte analogue of


     if (u < umax)
       {
          buf[0] = *u++;
          buf[1] = 0;
       }

See Also

SLutf8_decode, SLutf8_skip_char

1.9 SLutf8_encode_null_terminate

Synopsis

UTF-8 encode a character and null terminate the result

Usage

SLuchar_Type *SLutf8_encode_null_terminate (w, buf)


   SLwchar_Type w;
   SLuchar_Type *buf;

Description

This function has the same functionality as SLutf8_encode, except that it also null terminates the encoded sequences. The buffer buf, where the encoded sequence is placed, is assumed to consist of at least SLUTF8_MAX_BLEN+1 bytes.

See Also

SLutf8_encode

1.10 SLutf8_strup

Synopsis: Uppercase a UTF-8 encoded string
Usage: SLuchar_Type *SLutf8_strup (SLuchar_Type *u, SLuchar_Type *umax)
Description: The SLutf8_strup function returns the uppercase equivalent of UTF-8 encoded sequence of umax-u bytes at u. The result will be returned as a null-terminated SLstring and should be freed with SLang_free_slstring when it is nolonger needed. If the function encounters an invalid of illegal byte sequence, then the byte-sequence will be copied as as-is.
See Also: SLutf8_strlow, SLwchar_toupper

1.11 SLutf8_strlo

Synopsis: Lowercase a UTF-8 encoded string
Usage: SLuchar_Type *SLutf8_strlo (SLuchar_Type *u, SLuchar_Type *umax)
Description: The SLutf8_strlo function returns the lowercase equivalent of UTF-8 encoded sequence of umax-u bytes at u. The result will be returned as a null-terminated SLstring and should be freed with SLang_free_slstring when it is nolonger needed. If the function encounters an invalid of illegal byte sequence, then the byte-sequence will be copied as as-is.
See Also: SLutf8_strlow, SLwchar_toupper

1.12 SLutf8_subst_wchar

Synopsis

Replace a character in a UTF-8 encoded string

Usage

SLstr_Type *SLutf8_subst_wchar (u, umax, wch, nth,ignore_combining)


   SLuchar_Type *u, *umax;
   SLwchar_Type wch;
   unsigned int nth;
   int ignore_combining;

Description

The SLutf8_subst_wchar function replaces the UTF-8 sequence representing the nth character of u by the UTF-8 representation of the character wch. If the value of the ignore_combining parameter is non-zero, then combining characters will not be counted when computing the position of the nth character. In addition, if the nth character contains any combining characters, then the byte-sequence associated with those characters will also be replaced.

Since the byte sequence representing wch could be longer than the sequence of the nth character, the function returns a new copy of the resulting string as an SLSTRING. Hence, the calling function should call SLang_free_slstring when the result is nolonger needed.

See Also

SLutf8_strup, SLutf8_strlow, SLutf8_skip_chars, SLutf8_strlen

1.13 SLutf8_compare

Synopsis

Compare two UTF-8 encoded sequences

Usage

int SLutf8_compare (a, amax, b, bmax, nchars, case_sensitive)


   SLuchar_Type *a, *amax;
   SLuchar_Type *b, *bmax;
   unsigned int nchars;
   int case_sensitive;

Description

This function compares nchars of one UTF-8 encoded character sequence to another by performing a character by character comparison. The function returns 0, +1, or -1 according to whether the string a is is equal to, greater than, or less than the string at b. At most nchars characters will be tested. The parameters amax and bmax serve as upper boundaries of the strings a and b, resp.

If the value of the case_sensitive parameter is non-zero, then a case-sensitive comparison will be performed, otherwise characters will be compared in a case-insensitive manner.

Notes

For case-sensitive comparisons, this function is analogous to the standard C library's strncmp function. However, SLutf8_compare can also cope with invalid or illegal UTF-8 sequences.

See Also

SLutf8_strup, SLutf8_strlen, SLutf8_strlen

Next Previous Contents