jed-users mailing list

[2007 Date Index] [2007 Thread Index] [Other years]
[Thread Prev] [Thread Next] [Date Prev] [Date Next]

Re: UTF-8 and Regular Expressions

Subject: Re: UTF-8 and Regular Expressions
From: "G. Milde" <milde@xxxxxxxxxxxxxxxxxxxxx>
Date: Wed, 18 Apr 2007 09:19:52 +0200

On 11.04.07, John E. Davis wrote:
> Hi,

> For those of you that use jed with UTF-8 encoded text, has the lack of
> true UTF-8 support by the regular expression functions been much of an
> impediment?  For example, the regular expression "." matches a single
> character, which in the UTF-8 encoding could consist of several bytes.
> However, the slang regular expression code has no knowledge of
> UTF-8, and as a result "." will match exactly one byte.

Does this mean that also a pattern like "[aä]" will fail if the "ä" is a
2-byte UTF-8 char?

...
> I plan to integrate PCRE with jed in the near future.  
...

Good news. Would this also affect the DFA highlight patterns?

> In testing, one problem has come up: When used in UTF-8 mode, PCRE
> cannot tolerate malformed text.  This can be a problem when jed is
> running in UTF-8 mode, but one is editing text in some other encoding,
> e.g., ISO-Latin-1.

However, when editing text in Jed-U, "the right thing" would be to
convert it transparently to UTF-8 in a find_file_hook and re-convert back
when saving (analog to compress.sl). 

The "iso-lat*.sl" files could be modified to provide a default-non-UTF-8
encoding, e.g.

%%  iso-latin.sl
%%  Initializes upper/lowercase lookup tables for ISO Latin 1

if (_slang_utf8_ok)
   Legacy_Encoding = "iso-latin-1";
else
   {
      .   0  64 1 { dup define_case } _for
      ...
   }

Conversion could be done by `iconv`, `recode` or (from|to latin-1) a
poor-mans converter in SLang. Jörg did post (part of) such a solution
some time ago to the list.

> My inclination is that if the lack of UTF-8 support by the current
> regular expression engine is not much of a problem, then I think that
> by default, regular expressions will be compiled using byte-semantics,
> independent of whether or not jed is running in UTF-8 mode.

I do have the impression, that it would be quite surprising
if re_search_forward did in UTF-8 mode pattern "f..r" did match "för"
(with ö == U+00F6 LATIN SMALL LETTER O WITH DIAERESIS) and a pattern
"f[oö]r" were invalid (or not matching "för").

Sorry for my late answer. I currently do not use UTF-8 (but plan to
convert sometimes).

Günter

--------------------------
To unsubscribe send email to <jed-users-request@xxxxxxxxxxx> with
the word "unsubscribe" in the message body.
Need help? Email <jed-users-owner@xxxxxxxxxxx>.

Follow-Ups:
- Re: UTF-8 and Regular Expressions
  - From: Jörg Sommer

References:
- UTF-8 and Regular Expressions
  - From: John E. Davis

[2007 date index] [2007 thread index]
[Thread Prev] [Thread Next] [Date Prev] [Date Next]