- Subject: Re: UTF-8 and Regular Expressions
- From: "G. Milde" <milde@xxxxxxxxxxxxxxxxxxxxx>
- Date: Wed, 18 Apr 2007 09:19:52 +0200
On 11.04.07, John E. Davis wrote:
> Hi,
> For those of you that use jed with UTF-8 encoded text, has the lack of
> true UTF-8 support by the regular expression functions been much of an
> impediment? For example, the regular expression "." matches a single
> character, which in the UTF-8 encoding could consist of several bytes.
> However, the slang regular expression code has no knowledge of
> UTF-8, and as a result "." will match exactly one byte.
Does this mean that also a pattern like "[aä]" will fail if the "ä" is a
2-byte UTF-8 char?
...
> I plan to integrate PCRE with jed in the near future.
...
Good news. Would this also affect the DFA highlight patterns?
> In testing, one problem has come up: When used in UTF-8 mode, PCRE
> cannot tolerate malformed text. This can be a problem when jed is
> running in UTF-8 mode, but one is editing text in some other encoding,
> e.g., ISO-Latin-1.
However, when editing text in Jed-U, "the right thing" would be to
convert it transparently to UTF-8 in a find_file_hook and re-convert back
when saving (analog to compress.sl).
The "iso-lat*.sl" files could be modified to provide a default-non-UTF-8
encoding, e.g.
%% iso-latin.sl
%% Initializes upper/lowercase lookup tables for ISO Latin 1
if (_slang_utf8_ok)
Legacy_Encoding = "iso-latin-1";
else
{
. 0 64 1 { dup define_case } _for
...
}
Conversion could be done by `iconv`, `recode` or (from|to latin-1) a
poor-mans converter in SLang. Jörg did post (part of) such a solution
some time ago to the list.
> My inclination is that if the lack of UTF-8 support by the current
> regular expression engine is not much of a problem, then I think that
> by default, regular expressions will be compiled using byte-semantics,
> independent of whether or not jed is running in UTF-8 mode.
I do have the impression, that it would be quite surprising
if re_search_forward did in UTF-8 mode pattern "f..r" did match "för"
(with ö == U+00F6 LATIN SMALL LETTER O WITH DIAERESIS) and a pattern
"f[oö]r" were invalid (or not matching "för").
Sorry for my late answer. I currently do not use UTF-8 (but plan to
convert sometimes).
Günter
--------------------------
To unsubscribe send email to <jed-users-request@xxxxxxxxxxx> with
the word "unsubscribe" in the message body.
Need help? Email <jed-users-owner@xxxxxxxxxxx>.
[2007 date index]
[2007 thread index]
[Thread Prev] [Thread Next]
[Date Prev] [Date Next]