> -----Original Message----- > From: G. Milde > Sent: mercoledì 18 aprile 2007 9.20 > Subject: Re: UTF-8 and Regular Expressions > Hi, > > In testing, one problem has come up: When used in UTF-8 mode, PCRE > > cannot tolerate malformed text. This can be a problem when jed is > > running in UTF-8 mode, but one is editing text in some other encoding, > > e.g., ISO-Latin-1. > > However, when editing text in Jed-U, "the right thing" would be to > convert it transparently to UTF-8 in a find_file_hook and re-convert back > when saving (analog to compress.sl). I not only agree, I have also a stronger feeling: I think that internally JED should always work in utf-8 (dropping support for 8bit characters), and convert from/to local encoding (using locale() information or some sort of -*- encoding: -*- marker in the file if present) as needed. Clearly to avoid big regressions this should be done backwards: first we need a robust support for on-the-fly encoding conversion, only after we can drop support for 8bit internal encoding. > Conversion could be done by `iconv`, `recode` or (from|to latin-1) a > poor-mans converter in SLang. Jörg did post (part of) such a solution > some time ago to the list. > Well, I don't know if Jörg wrote something for this, but I did: I have an iconv module for SLang. This is the message to this list announcing iconv_module: http://www.ruptured-duck.com/jed-users/msg00721.html (please don't use the attachment to that message: it changed a lot: see the file attached to this mail). Here is a mail Marko Mahnic wrote some time ago, I think it gives a good description of what is needed to support multiple charsets: http://ruptured-duck.com/jed-users/msg00515.html Another interesting thread about utf-8 and charsets: http://www.ruptured-duck.com/jed-users-2003/msg00373.html In one of the threads highlighted above, John said he prefers to add to SLang a native interface for charset conversions, instead of a module. This way we can write some 'poor's man' version for systems without iconv. For modern linuxes (anything with glibc) this is not a problem, as iconv is integrated in glibc, and for windows, well, my JED installer already ships iconv.dll :-) > > My inclination is that if the lack of UTF-8 support by the current > > regular expression engine is not much of a problem, then I think that > > by default, regular expressions will be compiled using byte-semantics, > > independent of whether or not jed is running in UTF-8 mode. > > I do have the impression, that it would be quite surprising > if re_search_forward did in UTF-8 mode pattern "f..r" did match "för" > (with ö == U+00F6 LATIN SMALL LETTER O WITH DIAERESIS) and a pattern > "f[oö]r" were invalid (or not matching "för"). > Yes, I think is a bit strange... Anyways, I use regular expression a lot, and I don't remember ever needing or having a problem because of missing utf-8 support. But probably I'm not a good test case: in Italian we have very few characters outside the ASCII (7 bit) set. Thanks, Dino
Attachment:
jed-charset.zip
Description: Binary data