jed-users mailing list

[2007 Date Index] [2007 Thread Index] [Other years]
[Thread Prev] [Thread Next] [Date Prev] [Date Next]

UTF-8 and Regular Expressions

Subject: UTF-8 and Regular Expressions
From: "John E. Davis" <davis@xxxxxxxxxxxxx>
Date: Wed, 11 Apr 2007 00:41:40 -0400

Hi,

For those of you that use jed with UTF-8 encoded text, has the lack of
true UTF-8 support by the regular expression functions been much of an
impediment?  For example, the regular expression "." matches a single
character, which in the UTF-8 encoding could consist of several bytes.
However, the slang regular expression code has no knowledge of
UTF-8, and as a result "." will match exactly one byte.

The reason I ask is that slang 3.0 will use PCRE as its regular
expression library.  In anticipation of this, I plan to integrate PCRE
with jed in the near future.  In fact, I already have a version that
uses PCRE.  In testing, one problem has come up: When used in UTF-8
mode, PCRE cannot tolerate malformed text.  This can be a problem
when jed is running in UTF-8 mode, but one is editing text in some
other encoding, e.g., ISO-Latin-1.

My inclination is that if the lack of UTF-8 support by the current
regular expression engine is not much of a problem, then I think that
by default, regular expressions will be compiled using byte-semantics,
independent of whether or not jed is running in UTF-8 mode.

Finally, I mentioned slang 3.0 but I have no idea when that will be
released.  I am still working on slang-2.1, which will be released in
the very near future.

Thanks,
--John

--------------------------
To unsubscribe send email to <jed-users-request@xxxxxxxxxxx> with
the word "unsubscribe" in the message body.
Need help? Email <jed-users-owner@xxxxxxxxxxx>.

Follow-Ups:
- Re: UTF-8 and Regular Expressions
  - From: Jörg Sommer
- Re: UTF-8 and Regular Expressions
  - From: G. Milde

[2007 date index] [2007 thread index]
[Thread Prev] [Thread Next] [Date Prev] [Date Next]