Skip past a UTF-8 encoded character
SLuchar_Type *SLutf8_skip_char (SLuchar_Type *u, SLuchar_Type *umax)
The SLutf8_skip_char
function returns a pointer to the
character immediately following the UTF-8 encoded character at
u
. It will make no attempt to examine the bytes at the
position umax
and beyond. If the bytes at u
do not
represent a valid or legal UTF-8 encoded sequence, a pointer to the
byte following u
will be returned.
Unicode combining characters are treated as distinct characters by this function.
SLutf8_skip_chars, SLutf8_bskip_char, SLutf8_strlen
Skip past a specified number of characters in a UTF-8 encoded string
SLuchar_Type *SLutf8_skip_chars (u, umax, num, dnum, ignore_combining)
SLuchar_Type *u, *umax;
unsigned int num;
unsigned int *dnum;
int ignore_combining;
This functions attempts to skip forward past num
UTF-8 encoded
characters at u
returning the actual number skipped via the
parameter dnum
. It will make no attempt to examine bytes at
umax
and beyond. Unicode combining characters will not be
counted if ignore_combining
is non-zero, otherwise they will
be treated as distinct characters. If the input contains an
invalid or illegal UTF-8 sequence, then each byte in the sequence
will be treated as a single character.
SLutf8_skip_char, SLutf8_bskip_chars
Skip backward past a UTF-8 encoded character
SLuchar_Type *SLutf8_bskip_char (SLuchar_Type *umin, SLuchar_Type *u)
The SLutf8_bskip_char
skips backward to the start of the
UTF-8 encoded character immediately before the position u
.
The function will make no attempt to examine characters before the
position umin
. UTF-8 combining characters are treated as
distinct characters.
SLutf8_bskip_chars, SLutf8_skip_char
Skip backward past a specified number of UTF-8 encoded characters
SLuchar_Type *SLutf8_bskip_chars (umin, u, num, dnum, ignore_combining)
SLuchar_Type *umin, *u;
unsigned int num;
unsigned int *dnum;
int ignore_combining;
This functions attempts to skip backward past num
UTF-8
encoded characters occurring immediately before u
. It returns
the the actual number skipped via the parameter dnum
. No
attempt will be made to examine the bytes occurring before umin
.
Unicode combining characters will not be counted if ignore_combining
is non-zero, otherwise they will be treated as distinct characters.
If the input contains an invalid or illegal UTF-8 sequence, then each
byte in the sequence will be treated as a single character.
SLutf8_skip_char, SLutf8_bskip_chars
Decode a UTF-8 encoded character sequence
SLuchar_Type *SLutf8_decode (u, umax, w, nconsumedp
SLuchar_Type *u, *umax;
SLwchar_Type *w;
unsigned int *nconsumedp;
The SLutf8_decode
function decodes the UTF-8 encoded character
occurring at u
and returns the decoded character via the
parameter w
. No attempt will be made to examine the bytes at
umax
and beyond. If the parameter nconsumedp
is
non-NULL, then the number of bytes consumed by the function will
be returned to it. If the sequence at u
is invalid or
illegal, the function will return NULL
and with the number of
bytes consumed by the function equal to the size of the invalid
sequence. Otherwise the function will return a pointer to byte
following encoded sequence.
SLutf8_decode, SLutf8_strlen, SLutf8_skip_char
UTF-8 encode a character
SLuchar_Type *SLutf8_encode (w, u, ulen)
SLwchar_Type w;
SLuchar_Type *u;
unsigned int ulen;
This function UTF-8 encodes the Unicode character represented by
w
and stored the encoded representation in the buffer of size
ulen
bytes at u
. The function will return NULL
if the
size of the buffer is too small to represent the UTF-8 encoded
character, otherwise it will return a pointer to the byte following
encoded representation.
This function does not null terminate the resulting byte sequence.
The function SLutf8_encode_null_terminate
may be used for that
purpose.
To guarantee that the buffer is large enough to hold the encoded
bytes, its size should be at least SLUTF8_MAX_BLEN
bytes.
The function will encode illegal Unicode characters, i.e., characters in the range 0xD800-0xFFFF (the UTF-16 surrogates) and 0xFFFE-0xFFFF.
SLutf8_decode, SLutf8_encode_bytes, SLutf8_encode_null_terminate
Determine the number of characters in a UTF-8 sequence
unsigned int SLutf8_strlen (SLuchar_Type *s, int ignore_combining)
This function may be used to determine the number of characters
represented by the null-terminated UTF-8 byte sequence. If the
ignore_combining
parameter is non-zero, then Unicode combining
characters will not be counted.
SLutf8_skip_chars, SLutf8_decode
Extract a UTF-8 encoded character
SLuchar_Type *SLutf8_extract_utf8_char (u, umax, buf)
SLuchar_Type *u, *umax, *buf;
This function extracts the bytes representing UTF-8 encoded character
at u
and places them in the buffer buf
, and then null
terminates the result. The buffer is assumed to consist of at least
SLUTF8_MAX_BLEN+1
bytes, where the extra byte may be necessary
for null termination. No attempt will be made to examine the
characters at umax
and beyond. If the byte-sequence at
u
is an illegal or invalid UTF-8 sequence, then the byte at
u
will be copied to the buffer. The function returns a
pointer to the byte following copied bytes.
One may think of this function as the single byte analogue of
if (u < umax)
{
buf[0] = *u++;
buf[1] = 0;
}
SLutf8_decode, SLutf8_skip_char
UTF-8 encode a character and null terminate the result
SLuchar_Type *SLutf8_encode_null_terminate (w, buf)
SLwchar_Type w;
SLuchar_Type *buf;
This function has the same functionality as SLutf8_encode
,
except that it also null terminates the encoded sequences. The
buffer buf
, where the encoded sequence is placed, is assumed
to consist of at least SLUTF8_MAX_BLEN+1
bytes.
SLutf8_encode
Uppercase a UTF-8 encoded string
SLuchar_Type *SLutf8_strup (SLuchar_Type *u, SLuchar_Type *umax)
The SLutf8_strup
function returns the uppercase equivalent of
UTF-8 encoded sequence of umax-u
bytes at u
. The
result will be returned as a null-terminated SLstring
and
should be freed with SLang_free_slstring
when it is nolonger
needed. If the function encounters an invalid of illegal byte
sequence, then the byte-sequence will be copied as as-is.
SLutf8_strlow, SLwchar_toupper
Lowercase a UTF-8 encoded string
SLuchar_Type *SLutf8_strlo (SLuchar_Type *u, SLuchar_Type *umax)
The SLutf8_strlo
function returns the lowercase equivalent of
UTF-8 encoded sequence of umax-u
bytes at u
. The
result will be returned as a null-terminated SLstring
and
should be freed with SLang_free_slstring
when it is nolonger
needed. If the function encounters an invalid of illegal byte
sequence, then the byte-sequence will be copied as as-is.
SLutf8_strlow, SLwchar_toupper
Replace a character in a UTF-8 encoded string
SLstr_Type *SLutf8_subst_wchar (u, umax, wch, nth,ignore_combining)
SLuchar_Type *u, *umax;
SLwchar_Type wch;
unsigned int nth;
int ignore_combining;
The SLutf8_subst_wchar
function replaces the UTF-8 sequence
representing the nth
character of u
by the UTF-8
representation of the character wch
. If the value of the
ignore_combining
parameter is non-zero, then combining
characters will not be counted when computing the position of the
nth
character. In addition, if the nth
character
contains any combining characters, then the byte-sequence associated
with those characters will also be replaced.
Since the byte sequence representing wch
could be longer than
the sequence of the nth
character, the function returns a new
copy of the resulting string as an SLSTRING
. Hence, the
calling function should call SLang_free_slstring
when the
result is nolonger needed.
SLutf8_strup, SLutf8_strlow, SLutf8_skip_chars, SLutf8_strlen
Compare two UTF-8 encoded sequences
int SLutf8_compare (a, amax, b, bmax, nchars, case_sensitive)
SLuchar_Type *a, *amax;
SLuchar_Type *b, *bmax;
unsigned int nchars;
int case_sensitive;
This function compares nchars
of one UTF-8 encoded character
sequence to another by performing a character by character comparison.
The function returns 0, +1, or -1 according to whether the string
a
is is equal to, greater than, or less than the string at
b
. At most nchars
characters will be tested. The
parameters amax
and bmax
serve as upper boundaries of
the strings a
and b
, resp.
If the value of the case_sensitive
parameter is non-zero, then
a case-sensitive comparison will be performed, otherwise characters
will be compared in a case-insensitive manner.
For case-sensitive comparisons, this function is analogous to the
standard C library's strncmp
function. However,
SLutf8_compare
can also cope with invalid or illegal UTF-8
sequences.
SLutf8_strup, SLutf8_strlen, SLutf8_strlen