- Subject: Re: iconv module documentation or examples?
- From: "John E. Davis" <davis@xxxxxxxxxxxxx>
- Date: Sun, 31 Aug 2008 11:59:21 -0400
Ethan Blanton <eblanton@xxxxxxxxxxxx> wrote:
> John E. Davis spake unto us the following wisdom:
>> For this reason, I think that something like this might work:
>>
>> define get_encoding ()
>> {
>> if (_slang_utf8_ok) return "UTF-8";
>> variable lang = getenv ("LANG");
>> if (lang == NULL)
>> return NULL;
>> variable fields = strchop (lang, '.', 0);
>> if (2 == length (fields))
>> return fields[1];
>> return NULL;
>> }
> Unfortunately, this won't be reliable, for several reasons; one,
> locale names are up to the system to at least some extent, so they can
> choose to stuff other information in that space (look at the list of
> locale on a non-Linux non-386BSD-derived system; they're often weird
> and wonderful). Two, even on systems with regular locale name synta
I had forgotten about extra optional fields in the LANG string. The
setlocale man page says:
A locale name is typically of the form language[_territory][.code-
set][@modifier], where language is an ISO 639 language code,
territory is an ISO 3166 country code, and codeset is a character
set or encoding identifier like ISO-8859-1 or UTF-8. For a list of
all supported locales, try "locale -a", cf. locale(1).
(Note the use of the word "typically" above, which, as you said,
indicates that the LANG variable need not be in the assumed form.)
So perhaps the
if (2 == length (fields))
return fields[1];
should be changed to
if (2 == length (fields))
return strchop (fields[1], '@', 0)[0];
> x like above, the character set is not always present. The "C"
> locale, for example, is required to exist, and its associated
> character set will always (if I'm not mistaken) be whatever the
> current system calls ASCII ("ANSI_X3.4-1968" on recent glibc, "646" on
> Solaris, etc.). Finally, even when that string does represent the
> character set in some way, it may not be in the canonical form
> required by the system iconv. (GNU iconv is pretty liberal in what it
> accepts, but many other systems are much more strict. UTF-8, UTF8,
> and utf8 may not all be valid encodings on all systems, for example.)
Nevertheless it seems to me that the system iconv should support the
codeset specified in the locale, even though the exact form of the
codeset string is system-dependent.
I see this apparant lack of standardization is another argument for
the exclusive use, where practical, of the UTF-8 encoding.
Thanks,
--John
--------------------------
To unsubscribe send email to <jed-users-request@xxxxxxxxxxx> with
the word "unsubscribe" in the message body.
Need help? Email <jed-users-owner@xxxxxxxxxxx>.
[2008 date index]
[2008 thread index]
[Thread Prev] [Thread Next]
[Date Prev] [Date Next]