jed-users mailing list

[2007 Date Index] [2007 Thread Index] [Other years]
[Thread Prev] [Thread Next]      [Date Prev] [Date Next]

UTF-8 and Regular Expressions


Hi,

For those of you that use jed with UTF-8 encoded text, has the lack of
true UTF-8 support by the regular expression functions been much of an
impediment?  For example, the regular expression "." matches a single
character, which in the UTF-8 encoding could consist of several bytes.
However, the slang regular expression code has no knowledge of
UTF-8, and as a result "." will match exactly one byte.

The reason I ask is that slang 3.0 will use PCRE as its regular
expression library.  In anticipation of this, I plan to integrate PCRE
with jed in the near future.  In fact, I already have a version that
uses PCRE.  In testing, one problem has come up: When used in UTF-8
mode, PCRE cannot tolerate malformed text.  This can be a problem
when jed is running in UTF-8 mode, but one is editing text in some
other encoding, e.g., ISO-Latin-1.

My inclination is that if the lack of UTF-8 support by the current
regular expression engine is not much of a problem, then I think that
by default, regular expressions will be compiled using byte-semantics,
independent of whether or not jed is running in UTF-8 mode.

Finally, I mentioned slang 3.0 but I have no idea when that will be
released.  I am still working on slang-2.1, which will be released in
the very near future.

Thanks,
--John

--------------------------
To unsubscribe send email to <jed-users-request@xxxxxxxxxxx> with
the word "unsubscribe" in the message body.
Need help? Email <jed-users-owner@xxxxxxxxxxx>.


[2007 date index] [2007 thread index]
[Thread Prev] [Thread Next]      [Date Prev] [Date Next]