- Subject: [slang-users] Re: unicode (was Re: Minor error message change)
- From: Bart Oldeman <bartoldeman@xxxxxxxxxxxxxxxxxxxxx>
- Date: Sun, 29 Aug 2004 15:03:38 +1200 (NZST)
Hi,
I've personally had to deal with some of the utf8 issues recently as
the maintainer of DOSEMU. It already internally converts things from the
DOS character set (say cp437) to an external character set (the LC_CTYPE
one by default) via Unicode. So I'm looking forward to slang-2.0 as right
now always some cp437 characters get lost.
John E. Davis wrote:
> I realize that
> converting from one character set to another is more or less a solved
> problem and as such, it is not an issue. But as Pavel pointed out the
> terminal (xterm, rxvt, etc) is the problem.
As far as I can see "luit" solves a large part of this problem, by
enforcing the terminal to behave as specified in LC_CTYPE. xterm invokes
luit automatically, so with xterm -u8, but LC_CTYPE corresponding to
ISO8859-1 it will still behave like a latin terminal.
> To allow me to deal with UTF-8 encoded files on a non-UTF-8 terminal, I
> have added the ability to turn on or off support for UTF-8 in the
> various slang layers.
I wonder if this is necessary. Unless I miss something I would personally
only distinguish between a "plain 8 bit mode" (the only thing Slang 1.x
supports), and an LC_CTYPE mode, and not special case UTF-8.
Internally it seems better to do everything using wchar_t, rather than
UTF-8. If LC_CTYPE is not UTF-8 but your strings are then the C library
will be thoroughly confused...
Then, to display a UTF-8 file on a latin terminal you could:
1. convert UTF-8 to wchar_t at the moment you read the file.
2. work internally using wchar_t.
3. convert using wcrtomb to the LC_CTYPE set.
4. display the resulting string using SLsmg_write_string() or similar.
(step 3 and 4 could be combined if an wide character version of
SLsmg_write_string() would exist).
This means that nowhere it is necessary to call nl_langinfo(CODESET),
nowhere you actually need to *know* that the current set is multibyte,
UTF-8 or whatever else.
However. What I personnaly miss about the patched Slang that various Linux
distributors are shipping is that LC_CTYPE is the only way and there is no
way to do a "straight through" SLsmg_write_string(). For DOSEMU I'd like
to be able to do this if I can switch the terminal into cp437 mode (as is
possible on the Linux console).
So suppose I were able to do:
SLsmg_8bit_enable (1);
then the internal representation could collapse into the current
internal representation, only using an 8bit character and colour inside
the structure, without any UCS conversion.
Bart
_______________________________________________
To unsubscribe, visit http://jedsoft.org/slang/mailinglists.html
[2004 date index]
[2004 thread index]
[Thread Prev] [Thread Next]
[Date Prev] [Date Next]