- Subject: UTF-8 and Regular Expressions
- From: "John E. Davis" <davis@xxxxxxxxxxxxx>
- Date: Wed, 11 Apr 2007 00:41:40 -0400
Hi,
For those of you that use jed with UTF-8 encoded text, has the lack of
true UTF-8 support by the regular expression functions been much of an
impediment? For example, the regular expression "." matches a single
character, which in the UTF-8 encoding could consist of several bytes.
However, the slang regular expression code has no knowledge of
UTF-8, and as a result "." will match exactly one byte.
The reason I ask is that slang 3.0 will use PCRE as its regular
expression library. In anticipation of this, I plan to integrate PCRE
with jed in the near future. In fact, I already have a version that
uses PCRE. In testing, one problem has come up: When used in UTF-8
mode, PCRE cannot tolerate malformed text. This can be a problem
when jed is running in UTF-8 mode, but one is editing text in some
other encoding, e.g., ISO-Latin-1.
My inclination is that if the lack of UTF-8 support by the current
regular expression engine is not much of a problem, then I think that
by default, regular expressions will be compiled using byte-semantics,
independent of whether or not jed is running in UTF-8 mode.
Finally, I mentioned slang 3.0 but I have no idea when that will be
released. I am still working on slang-2.1, which will be released in
the very near future.
Thanks,
--John
--------------------------
To unsubscribe send email to <jed-users-request@xxxxxxxxxxx> with
the word "unsubscribe" in the message body.
Need help? Email <jed-users-owner@xxxxxxxxxxx>.
[2007 date index]
[2007 thread index]
[Thread Prev] [Thread Next]
[Date Prev] [Date Next]