STR #2158: In Fl_Text_Display, word wrapping do not work when mixing with unicode ata .

STR #2158

Application:	FLTK Library
Status:	1 - Closed w/Resolution
Priority:	5 - Critical, e.g. nothing working at all
Scope:	3 - Applies to all machines and operating systems
Subsystem:	Core Library
Summary:	In Fl_Text_Display, word wrapping do not work when mixing with unicode ata .
Version:	1.3-current
Created By:	sparkaround
Assigned To:	matt
Fix Version:	1.3.0 (SVN: v7812)
Update Notification:	Receive EMails Don't Receive EMails

Trouble Report Files:

#1	engelsman 13:24 Apr 04, 2009

wrap_text.cxx
3k

#2	sparkaround 02:27 Mar 24, 2010

textdisplay.cxx
0k

#3	sparkaround 02:27 Mar 24, 2010

textdisplay.png
2k

#4	sparkaround 03:57 Mar 24, 2010

textdisplay2.png
2k

#5	engelsman 13:40 Mar 29, 2010

wrap_text2.cxx
3k

#6	engelsman 13:35 Mar 30, 2010

str2158.zip
60k

#7	sparkaround 22:57 Mar 31, 2010

texteditor.png
4k

#8	engelsman 13:57 Apr 01, 2010

wrap_mode10.cxx
2k

#9	engelsman 13:57 Apr 01, 2010

wrap_mode10.png
11k

#10	engelsman 13:30 Apr 11, 2010

str2158b.zip
58k

#11	sparkaround 20:12 Apr 20, 2010

7551wrap10.png
6k

#12	sparkaround 07:51 Apr 21, 2010

wrap_mode10_cjk.png
9k

#13	sparkaround 07:52 Apr 21, 2010

wrap_mode10_cjk.cxx
1k

#14	sparkaround 07:57 Apr 21, 2010

wrap_mode10_cjk_v2.cxx
1k

#15	sparkaround 07:57 Apr 21, 2010

wrap_mode10_cjk_v2.png
8k

#16	sparkaround 08:11 Apr 21, 2010

wrap_mode10_decreased.png
8k

#17	engelsman 09:48 May 17, 2010

unidecode.cxx
7k

Trouble Report Comments:

#1	sparkaround 14:37 Feb 18, 2009

Fl_Text_Display widget didn't wrap word at the same column in after I enabled word wrapping with code like this: 'textdisplay->wrap_mode(1,80);' It seemed that the unicode data was counted by bytes in 'UTF-8' encoding. But 1.The unicode chars in utf8 encoding have different size in bytes. 2.On unicode environment, ascii was probably rendered with different font from the others or at least with glyph in different width from the others in pixels.

#2	engelsman 13:23 Apr 04, 2009

I've produced a test program that seems to have the behaviour that you describe, where the string is wrapped at the byte number rather than the character number. [There is a bug in the test code where I could not enter the correct three byte UTF-8 value, but the code does show the basic principle.] I noticed that in the code for calculating the display width of UTF-8 strings there is an optimisation when using fixed width fonts, hence the Helvetica and Courier buttons on the test program, but the font change does not make any difference to the wrap position. This program was written on Linux, using fltk-1.3.x-r6741.

#3	AlbrechtS 17:41 Apr 04, 2009

U+0080 is _not_ the Euro sign (correct is U+20AC). Try this instead: char utfTwoByte[] = "....\xc2\xb5.... "; // U+00b5 (Micro sign µ) char utfThreeByte[] = "....\xe2\x82\xac.... "; // U+20ac (Euro sign €) or maybe: char utfTwoByte[] = "....\xc2\xa4.... "; // U+00a4 (Currency ¤) The latter is interesting, because ISO-8859-15 defines x'a4' to be the Euro sign.

#4	engelsman 00:13 Apr 05, 2009

As you know, I'm no expert on non-ASCII character encodings. I just got the values from Markus Kuhn's Unicode FAQ: http://www.cl.cam.ac.uk/~mgk25/unicode.html or related pages, but as I've had problems even getting the character codes into the program, I've probably messed it up. I also wonder/doubt whether I have the correct fonts loaded to be able to display the characters properly anyway... D.

#5	greg.ercolano 00:56 Apr 05, 2009

If your browser can display it, you probably have the right fonts at least somewhere on your system. Under linux, you probably have to tweak around with Fl::setfont() at the head of your app to get FLTK to show the utf8 strings correctly, since by default it seems to select fonts with limited glyphs. I always eventually get things to work by noodling around with xfontsel() and doing 'locate iso-8859' and sniffing through the 'man -k iso-8859' man pages, and by having samples of the font copy/pasted from a website or customer having the problem, and trying different fonts from xfontsel until it works. Usually it's a matter of tracking down the right font to stick in Fl::set_font().. each rev of linux seems a bit different. With XFT, I've been able to get e.g. japanese to work with: Fl::set_font(FL_COURIER, "Kochi Gothic"); Without XFT, I usually have to mess with xfontsel. For European fonts, see eg. 'man iso-8859-1' which lists what the last digit means.

#6	ianmacarthur 03:56 Aug 28, 2009

I seem to be very late to this particular party, which is bad because it is entirely possible that some of the issues flagged are my fault... Anyway, trawling through the STR, there are a few oddities and misconceptions, so in the hope that I can add some clarity I offer the following thoughts... The Euro symbol: The sample code uses U+0080 for this, which is wrong. Sort of. Here comes the mess... The Unicode spec defines the range 0080 to 009F as non-printing control characters, but it seems like no one ever uses them. Now... lots of web pages that claim to be UTF8 are actually in some MS code page, derived from CP125x, which *does* use the range 0080 to 009F for characters, and in CP125x 0080 *is* the Euro symbol. So there is a sort of unofficial convention to assume that anything in the 0080 to 009F range is actually CP125x and map accordingly. Even fltk has some support for this, see the file fl_utf.c and search for the define ERRORS_TO_CP1252, which does exactly that. However, I do not think we have ERRORS_TO_CP1252 enabled by default. So in fltk, 0080 is *not* a Euro sign, although it possibly could be... As Albrecht observes later in the thread, a more appropriate choice for the Euro sign is U+20AC, but just to add further confusion, the code point U+20A0 is define as "EURO-CURRENCY SIGN", which is *not* the Euro symbol itself, but a standard symbol for currency... Confused yet? Anyway... The odd wrapping kind of looks as if we are just wrapping a wee bit early on the lines with non-ASCII glyphs? It sort of mostly works, but is just a bit premature? Is that what others are seeing? For the record, I'm actually trying this on a WinXP box right now, so XFT behaviour might well be different...! The OP doesn't say what platform they are on, but does mention that the non-ASCII chars seem to be rendered in a different font from everything else. This possibly implies the OP is on Vista or OSX, either of which can do automatic substitution of fonts to handle "missing" glyphs. Other platforms (XFT, Windows prior to Vista, etc.) don't do this auto-font-substitution but if we really want to, we probably can make them do it (albeit with a bit of extra work in fltk...) Now, the OP speculates that the wrap code is counting the physical bytes rather than the glyphs, and this is quite possibly true. The file fl_utf8.cxx provides the functions fl_utf8len() and fl_utf_nb_char() that we probably ought to be using, and I have a vague recollection that the text-editor stuff possibly has it's own private, and slightly different, versions of these functions too. No, I don’t know why. Lastly, Greg has some comments about how to find a font that can render his text; For what it is worth, I have (somewhere) a program that I wrote for XFT+fltk, that will tell you whether a font has the glyphs needed to render your text... If I can find it, you are welcome to a copy. In principle, it is "trivial" to extend that code to search all the installed fonts to find the one that has the fewest missing glyphs... Also, in principle, it is possible with XFT to construct a "super font" as a "union" of several actual fonts, to obtain the extra glyph coverage. Fltk does not currently support that, but there are patches posted (by Timothy Lee IIRC) that take us in that direction... Maybe we need to look at those again?

#7	engelsman 00:49 Dec 06, 2009

I think that I might have isolated the problem in this particular case, but I'm not sure whether any issues might pop up elsewhere in Fl_Text_Display and Fl_Text_Buffer handling. <pre> int Fl_Text_Buffer::character_width(char c, int indent, int tabDist, char nullSubsChar) { /* Note, this code must parallel that in Fl_Text_Buffer::ExpandCharacter */ if (c == '\t') return tabDist - (indent % tabDist); else if (((unsigned char) c) <= 31) return strlen(ControlCodeTable[ (unsigned char) c ]) + 2; else if (c == 127) return 5; else if (c == nullSubsChar) return 5; else if ((c & 0x80) && !(c & 0x40)) return 0; else if (c & 0x80) { return fl_utf8len(c); } return 1; } </pre> I think that the final call to fl_utf8len() is wrong in this case. A Unicode character represented as a UTF-8 byte sequence consists of a header byte that indicates how many bytes are in the sequence, and one or more non-header bytes. fl_utf8len() returns the number of bytes in the sequence if passed the header byte, and zero for the non-header bytes. The code above appears to correctly handle tabs, ASCII control and "nul" characters, and the UTF-8 non-header byte. However, instead of treating the UTF-8 header byte as indicating that the UTF-8 byte sequence expands to ONE character, it expands it to the number of bytes in the sequence. Changing "fl_utf8len(c)" to "1" in the above solves this problem. The Fl_Text_Display and FL_Text_Buffer code is hard to understand, and it may be that there are other areas where similar logic needs to be investigated. D.

#8	engelsman 09:29 Dec 13, 2009

Oops! I wrote: "fl_utf8len() returns the number of bytes in the sequence if if passed the header byte, and zero for the non-header bytes." but in fact it returns -1 for a non-header byte. However, in the Fl_Text_Buffer::character_width() code show, the non-header byte was already handled in the "if ((c & 0x80) && !(c & 0x40))" line. As far as I can tell, the other uses of fl_utf8len() appear correct, but it might make it more obvious if there were functions (macros?) such as: fl_uft8Byte(c) returns (c & 0x80) fl_utf8FirstByte(c) returns ((c & ox80) && (c & 0x40)) fl_utf8OtherByte(c) returns ((c & 0x80) && !(c & 0x40)) or whatever better naming scheme someone can come up with. fl_draw.cxx already has C_IN(c) and C_UTF8(c) macros for some of it.

#9	sparkaround 02:39 Mar 24, 2010

This problem still exists in fltk-1.3.x-r7302. I just uploaded a test code "textdisplay.cxx" together with the screenshot ("textdisplay.png"). I expected that the wide chars can be count by width of the char but not the bytes in utf8 encoding. And I found a new problem: the line with colorful characters will be wrapped earlier than the common line with the same font-size as if the line-wrap-margin is shorten.

#10	engelsman 03:44 Mar 24, 2010

I proposed a solution on the 6th December, which is a one-line fix, but I don't think that this has been changed in svn. Can you please modify Fl_Text_Buffer::character_width() in your local sources as described above [if looking at the STR view, not fltk.bugs] and confirm whether this solves your problem or not? Cheers D.

#11	sparkaround 04:08 Mar 24, 2010

Thank you. The result really changed when applying the fix. But this problem haven't solved completely. I have upload the result as "textdisplay2.png". In my example, the width of the selected wide character is 2. And after I replace the value with 2 instead of 1, the example code works correctly: else if (c & 0x80) { //return fl_utf8len(c); //return 1; return 2; } But this will only work for wide character in width 2. Is there any methods to fix this generally? Thanks again.

#12	engelsman 05:21 Mar 24, 2010

It's 4 months since I looked at the code, and I didn't really understand it properly then because it's all recursive and poorly refactored, but if I remember correctly, there are different paths through the code for constant width fonts and for proportional width fonts. So it looks like my fix catches the problem with calculating the correct number of characters (rather than bytes) for rendering in a fixed width font. Now we need to look at handling the pixel width of the characters when rendered in a proportional font. I'm afraid there won't be a quick fix because: - I don't have any spare time at the moment to look into it; - I don't have the same locale and Chinese(?) fonts as you; - I've forgotten how the code works; - most important of all, this is probably the most obscure and poorly structured code that I've ever worked with... D.

#13	engelsman 12:08 Mar 27, 2010

I've just started to look at the code again, and I could be wrong, but it seems that the different code paths for proportional and fixed width fonts has more to do with calculating maximum display width for use with scrollbars. Everywhere else seems to look at wrapping at a given column number, so it could be that your second problem is just a feature of how the code works and not a bug. But there's still a lot of code to check yet...

#14	sparkaround 23:15 Mar 27, 2010

Thank you! And I also found that when "wrap_mode" was called with margin=0(ex: wrap_mode(1,0)), the text will be wrapped based on the pixel width of the current widget. The style(font size) will be considered, too! But it is pity that wrap_mode(1,0) for wide character is even worse than the original wrap_mode(1,20). The line composed of wide characters will never be wrapped! It seemed that wrap_mode(1,0) use a method like(use duplicate code but not share the same call?) the one to decide whether the scrollbar should be visible. This is done with the help of font-api but not utf8len/widthchar leng. Yes, this is another wrap-method. But for the original wrap method(wrap_mode(1,N), N>0), I still think it is a bug?. There the width of a line was counted based on bytes in utf8 encoding, but not based on real number of characters(as your one-line fixed) or based on column-width of wide characters (some methods like api "wcwidth()").

#15	AlbrechtS 01:54 Mar 28, 2010

FWIW, just a few notes to make clear what I think to see from the example pictures. In FLTK 1.1 wrapping (wrap_mode(1,N), N>0) works only with fixed fonts correctly, because it only counts bytes (columns) of text. Duncan, from what I can see, I think that your patch is almost correct, but only for LGC characters, where each glyph occupies one column position. Sparkaround showed an example with CJK characters, where each character occupies two "normal" (LGC) character positions, and hence his patch version with 2 columns instead of 1 column for each character works better for his example. Conclusion(?): There is no such thing like a "fixed font width" in UTF-8, because LGC and CJK characters have _different_ widths. Possible solution: to avoid calling fl_measure() or some other expensive method, we could maybe calculate the Unicode code point and use a table to decide whether that particular character has a width of 1 or 2 columns. Sparkaround: Looking back at your post, I think that I just described what wcwidth() is intended to do. And for the record: http://www.opengroup.org/onlinepubs/009695399/functions/wcwidth.html But note in the above text: "This function was removed from the final ISO/IEC 9899:1990/Amendment 1:1995 (E), and the return value for a non-printable wide character is not specified." There is also Markus Kuhn's free implementation: http://www.cl.cam.ac.uk/~mgk25/ucs/wcwidth.c

#16	engelsman 04:13 Mar 28, 2010

If you poke around on the NEdit web pages, there's the comment somewhere, and I paraphrase here, that proportional fonts are not handled well because nedit is really a programmers' editor, designed for use with source code, with fixed width fonts and columns of characters. I think, therefore, that we might be trying to shoe-horn something into this code that is never going to fit properly. The code is not exactly clear and easily maintainable now, never mind adding special case code. But then, as the latest release of NEdit was 5.5 in 2004, so it looks unlikely that there will be new releases. So we don't need to remain compatible. Unless we provide a completely new widget, written from the ground up, we should probably be looking at making some clear statements in the documentation (for wrap_mode() perhaps?) that explain that: - wrapping is based on glyph pixel width only (no columns) - wrapping is based on 1 column per character for fixed width LGC fonts - wrapping is based on 2 column per character for fixed width CJK fonts - mixing the last two may produce unexpected results. Maybe we could extend wrap_mode() to have an additional parameter that specifies the number of columns per character, with 0 for mixed fonts. Just thinking aloud...

#17	engelsman 06:29 Mar 28, 2010

OK, been looking at this a bit further... If we add Markus Kuhn's wcwidth.c into src/xutf8 for testing, we still need to have a wrapper function that takes a char* pointing to the start of the UTF-8 sequence, converts it to UCS, and then calls mk_wcwidth(). Plus, we will need to rejig FL_Text_Buffer because at the moment we have int Fl_Text_Buffer::expand_character(char c, ...) and int Fl_Text_Buffer::character_width(char c, ...) If we overload these to have versions that take a char* we might be able to do it, but more investigation required.

#18	engelsman 10:42 Mar 28, 2010

After a lot of dicking around trying to add wcwidth.c to libfltk_xutf8.a and failing, I finally just #include'd it directly in Fl_Text_Buffer.cxx and set up: <pre> int Fl_Text_Buffer::expand_character(const char* s, int indent, char *outStr, int tabDist, char nullSubsChar) { char c = *s; if ((c & 0x80) && (c & 0x40)) { int len = fl_utf8len(c); int ret = 0; unsigned int ucs = fl_utf8decode(s, s+len, &ret); int width = mk_wcwidth((wchar_t)ucs); fprintf(stderr, "mk_wcwidth(%x) -> %d\n", ucs, width); return width; } return expand_character(c, indent, outStr, tabDist, nullSubsChar); } int Fl_Text_Buffer::character_width(const char* s, int indent, int tabDist, char nullSubsChar) { char c = *s; if ((c & 0x80) && (c & 0x40)) { int len = fl_utf8len(c); int ret = 0; unsigned int ucs = fl_utf8decode(s, s+len, &ret); int width = mk_wcwidth((wchar_t)ucs); fprintf(stderr, "mk_wcwidth(%x) -> %d\n", ucs, width); return width; } return character_width(c, indent, tabDist, nullSubsChar); } </pre> to call the existing versions, and started to substitute the calls for the originals with the new ones, but then you need to create a lookalike for Fl_Text_Buffer::character(pos) that returns a char* and so on. I got so far, but then the textdisplay example stopped displaying the chinese characters, even though I could see that the above were calling mk_wcwidth(0x54c8) and returning 2. Then I discovered that there's a whole load of calculations based on reverse mapping from column numbers on the display to pixel widths. At this point, I realised this is just one big can of worms, and it probably means that my other idea for adding a column indicator to wrap_mode() will probably be difficult to get working too :-(

#19	engelsman 12:22 Mar 28, 2010

Sorry guys, bit of brain-fade there earlier. I only needed the extra character_width() implementation, not the expand_character(). That gives me two rows of 10 'a' and 4 rows of 5 '[]' because I don't have the correct chinese locale and fonts. It still might not work correctly as part of Fl_Text_Editor with the reverse character to column mapping though. I haven't tested it beyond the 'textdisplay.cxx' at the moment, and I will need to tidy up the source before I paste it here for testing in sparkaround's real environment...

#20	engelsman 13:50 Mar 29, 2010

Just posted wrap_text2.cxx, a correction to wrap_text.cxx where the three utf8* string variables now contain dots, ucs, dots, space with: U+0024 (\x24) dollar U+00A9 (\xc2\xa9) copyright sign U+20AC (\xe2\x82\xac) euro currency sign after spending ages wondering why Markus Kuhn's wcwidth() returned -1 for the 2-byte literal euro symbol I had cut'n'pasted from somewhere. As Albrecht and Ian point out above, the euro wasn't what I thought.

#21	engelsman 13:59 Mar 30, 2010

I've just appended str2158.zip to the STR as a proof of concept. The zip file contains hacked versions of the following: - FL/Fl_Text_Buffer.H with declarations for characterP() and character_width(const char*...) - src/Fl_Text_Buffer.cxx with their implementations and calls to the new routines - src/Fl_Text_Display.cxx with calls to the new routines - src/xutf8/wcwidth.c Marcus Kuhn's wcwidth() implementation as discussed above Unfortunately, I've had these files hanging around for a while since I started the debugging in November 2009, so there are various other changes that show up if you just try to diff against the originals, However, if you search for the characterP() and character_width() calls, you can see that in most cases the following line is the a comment containing the original line preceded by a drg: tag. As I say, this is a dirty hack to check proof of concept, and don't know how complete it really is, or whether other problems will arise when tested against more text. As mentioned yesterday, I found that the euro character I had cut'n'pasted from somewhere into wrap_text.cxx was not really the valid UTF-8 encoding of the correct UCS character, and comes from the fact that HTML browsers often infer that 0x80 must come from the Windows-1252 rather that ISO-8859-1. Marcus Kuhn's code correctly returns -1 because 0x80 should be the PAD character... Unpack the zip file in a temp directory to be on the safe side, and make sure you save the original files before overwriting them. I've included wcwidth,c directly into Fl_Text_Buffer.cxx rather than add it to the libraries and link it in. This can be tidied up later. Does this solve your problem, sparkaround? .

#22	sparkaround 19:28 Mar 30, 2010

Thank you so much! It works well for ascii character and cjk words at least (wrap_mode(1,N); //N>0). By the way, when wrap_mode(1,0)(wrapped by pixel) is called characters are wrapped according to the pixel width of the characters displayed. That is userful for styled text if the bug for "wrap_mode(1,0)" is fixed too. Should I open another topic for that bug? Thanks.

#23	engelsman 00:11 Mar 31, 2010

Have you tested the patch against more text than in textdisplay.cxx? In any case, it looks like we're slowly eliminating problems :-) As for the wrap_mode(1, 0) problem, I'll need to poke around in the code to see what's happening as I haven't looked at that so far. And, yes, you are right. There is a lot of almost-duplicate code in there, several loops with an extra if statement in the header, or tangled with the loop body, so it's hard to get a handle on all code paths. And did I alread say that this code was highly recursive... :-)

#24	sparkaround 23:13 Mar 31, 2010

I tested the patch in Fl_Text_Display window and Fl_Text_Editor window, but only for CJK and alphabetic characters. Now I begin to test the characters mixing with those in AlbrechtS's example. And I found that "¤"(\xc2\xa4) was not wrapped correctly either on Fl_Text_Display or on Fl_Text_Editor. Please see "texteditor.png". It seemed that wrap_mode(1,0) can also be fixed,:). Thanks.

#25	engelsman 00:02 Apr 01, 2010

can you tell me what you think that wrap_mode(1,0) does, or should do? the comment above the routine is not clear, nor in step with the code, so it could be that what you want it to do is not what it really does.

#26	sparkaround 00:41 Apr 01, 2010

I wish the characters can be wrapped by the real pixel width on screen. The style or any others that can affect the width on screen when the characters are rendered should be considered. A simple implementation is just to wrap the characters at the current margin of the the widget and the horizontal scrollbar can be visible always. And if the widget is resized by the user, the line will be wrapped according to the new width of the wiget at once. I think this kind of implementation is very userful and popular in other toolkits. Even more. If it can accept a argument such as max_pix_width? That is to say, if the width of the widget is larger than max_pix_width, the text will also be wrapped by the argument. Thanks.

#27	engelsman 01:25 Apr 01, 2010

I'm sorry, maybe I wasn't clear enough. I wasn't asking for a feature request, I was asking what you thought wrap_mode() would do. Basically, what would you expect to see in the documentation for: - wrap_mode(0,0) - wrap_mode(0,N) where N > 0 - wrap_mode(1,0) - wrap_mode(1,N) where N > 0 The current description is clearly inadequate and therefore confusing. There is no mention of characters, pixels. widths or margins: "If mode is not zero, this call enables automatic word wrapping at column pos." There are two possible issues here. You may expect wrap_mode(1,0) to do one thing, but in fact it is designed to do something completely different, which is a documentation problem. On the other hand, you may expect wrap_mode(1,0) to do one thing, and it is designed to do that, but it doesn't do what it is supposed to do, which is a bug. I still have to examine the code properly to work out what it is doing.

#28	sparkaround 02:10 Apr 01, 2010

Here is my lame understanding: - wrap_mode(0,0) no wrap - wrap_mode(0,N) where N > 0 no wrap - wrap_mode(1,0) wrap by pixe. In fact, the characters is wrapped by words firstly. If word-wrap failed, the characters will be wrapped by character. - wrap_mode(1,N) where N > 0 wrap by character Just as above, word-wrapped was tried firstly, then character-wrap. "you may expect wrap_mode(1,0) to do one thing, and it is designed to do that, but it doesn't do what it is supposed to do, which is a bug." Yes, it is designed to do pixel-wrap without enough documentation, and it doesn't do what it is supposed to do always, which is a bug. Have I answered your asking?

#29	engelsman 14:00 Apr 01, 2010

Added wrap_mode10.cxx so that the problem is easily reproducible, and wrap_mode10.png showing result. Standard ascii chars wrap, but UTF-8 encoded chars do not. I hope you see the same effect as you resize the window. I shall start looking at this tomorrow...

#30	engelsman 06:16 Apr 04, 2010

I started to look at the wrap_mode(1,0) problem, and I think it will entail a lot more changes along the lines of what I did before. I see from the FLTK2 code that they work through the buffer one byte at a time, handle that byte, and then adjust the offset if that byte is the start of a UTF-8 sequence. On the one hand that's cleaner, but on the other it still doesn't address the column width problem because that requires knowing which UCS character we are dealing with. Therefore I either need to provide a whole series of additional routines that take a char* rather than a char, as I did above, and let each level extract the full UTF-8 byte sequence as needed, or I check for the UTF-8 sequence at the top level and then pass the UCS value rather than char to a series of new routines. Still haven't made up my mind on this one. On a related note, let's go back to wcwidth() and mk_wcwidth(). The Linux man page for wcwidth() says that it returns 0 for U+0000, the number of columns needed for printable wide characters, and -1 for non-printable characters. The behaviour also depends on LC_CTYPE. Markus Kuhn's implementation returns 0 for U+0000, only standard(?) control characters and DEL return -1, and all other return 0, 1, 2. There is no reference to locale specifics like LC_CTYPE in the code. If I build Markus Kuhn's implementation into FLTK-1.3.x xutf8 code and run the following: #include <wchar.h> #include <FL/Fl.H> #include <FL/fl_utf8.h> int main(int argc, char *argv[]) { for (wchar_t ucs = 0; ucs < 0xFFFF; ucs++) { int w1 = wcwidth(ucs); int w2 = mk_wcwidth(ucs); if (w1 != w2) printf("U+%04x: wcwidth()=%2d, mk_wcwidth()=%2d\n", ucs, w1, w2); } return 0; } I can see that there are a lot of characters that return -1 for the standard wcwidth(), but do return 0, 1, and 2 for mk_wcwidth(). I don't know whether the first is due to the limited number of locales that I have installed on this box or not. Although I have the feeling that providing a platform and locale(?) neutral implementation has its advantages, I wonder whether it might cause problems where wcwidth() is already being used elsewhere in the system. For example, where a system editor and the FLTK app show two different views of the same file. Should we be worried? And finally, as far as I can see, there's no reason why we can't just add Markus Kuhn's wcwidth.c to the xutf8 directory provided we keep the copyright etc. I don't think there's even a need to edit the file. Therefore, I have it all ready to be committed in the next few days. Any comments?

#31	engelsman 10:18 Apr 11, 2010

Although progress has been made in identifying the causes of the first two issues raised (number and width of UTF-8 glyphs when wrapping), the FL_Text_Buffer and FL_Text_Display code is currently being refactored to remove some old code, and to verify that the UTF-8 code is consistent and correct. These issues will be considered during the refactoring. For the wrap_mode(1,0) problem, see http://www.fltk.org/str.php?L2343 D.

#32	engelsman 13:36 Apr 11, 2010

I've just attached str2158b.zip, which is another hacked exploration, this time based on Matt's partially refactored code in r7479. Unpack in a temporary directory so you don't zap current state. This appears to work for the wrap_text, textdisplay and wrap_mode10 text programs posted earlier, with the proviso that I have no CJK fonts on my box, so can't see the real output. There's still a lot of FIXME markers in the code. I also identified expandTabs() as one area that will never work with UTF-8. Possibly.

#33	engelsman 15:15 Apr 20, 2010

I've just committed revision 7551 which adds the call to fl_wcwidth() to the Fl_Text_Buffer::character_width() routine. I believe that all changes are in place to address the three issues: - incorrect count of multi-byte UTF-8 characters, affecting margins; - incorrect column widths for CJK characters - incorrect wrap_mode(1,0) handling of UTF-8 pixel wrapping Could you please verify that r7551 solves your issues, and confirm?

#34	sparkaround 20:22 Apr 20, 2010

Thank you! I have checked rev 7551. "wrap_mode(1,N) N>0" works well just as before except for "¤". While "wrap_mode(1,0)" does not work well sometimes. When I input more CJK characters, the margin of last line will be shorten and part of the characters in that line will be moved to current line.It seemed that word-wrap happened unexpectedly here. Please check 7551wrap10.png.

#35	engelsman 22:54 Apr 20, 2010

> "wrap_mode(1,N) N>0" works well just as before except for "¤". Is that one specific character? or a series of characters that display like that. Can you give the(ir) UCS value(s)? Are these non-spacing, compose or combining characters? Or just single CJK characters?

#36	sparkaround 23:30 Apr 20, 2010

The ucs value of "¤" is "\xa4\x00" The utf8 value of "¤" is "\xc2\xa4" just as AlbrechtS said. "¤" is the only specific character. Does this character was wrapped correctly on your box? I guess CJK font was not needed to display this character. >Are these non-spacing, compose or combining characters? Or just single CJK characters? Do you mean wrap_mode(1,0)? They are all single CJK characters. I guess part of CJK characters was treated as spacing. In the image, the utf8 value of the first byte of the CJK character just after the shorten line is '\xe5' or '\xe6'. The Fl_Text_Editor in rev 7551 or snapshot 7513 seemed unstable. There are many new bugs relation to editing so that I can't edit the CJK characters freely to find more characteristics.

#37	engelsman 01:07 Apr 21, 2010

As far as I remember from the code, the Unicode U+00A4 Currency Sign should not require any special handling. As far as the CJK wrapping is concerned, there are 3 reasons I can see: 1. the wcwidth() inplementation has rules for certain characters, such as "Hangul Jamo medial vowels and final consonants (U+1160-U+11FF)" and returns 0 for the column width instead of 2. For more details. see: http://www.cl.cam.ac.uk/~mgk25/ucs/wcwidth.c 2. the Fl_Text_* line-breaking algorithm is too simple, and looks only at "latin" space characters as a line-break, and hence wrapping point. Are you expecting the lines to wrap at a particular point based on your knowledge of CJK? Unfortunately, the FL_Text_* line breaking and wrapping is based on a limited number of ascii whitespace characters. FLTK does not handle text formatting and layout for non-ascii characters, and it is unlikely to be added in the near future. 3. It is a bug. One final observation: until now I have only looked at Fl_Text_Buffer and Fl_Text_Display. If you are running in Fl_Text_Editor, it may be that there are extra things to take into account that I have not yet looked at.

#38	sparkaround 02:36 Apr 21, 2010

1. wrap_mode(1,N) N>0 Currency Sign will not be used frequently. If CJK's specific spacing characters can't be considered, I can try wrap_mode(1,0). For wrap_mode(1,N) N>0, I think that the fixing have been quite well for me. Thanks. 2.wrap_mode(1,0) >FLTK does not handle text formatting and layout for non-ascii characters, >and it is unlikely to be added in the near future. As far as I know, formated ascii text was counted well for wrap_mode(1,0). It is a pity if formated CJK text can't be counted according to the style. For wrap_mode(1,0), CJK text will also be wrapped at non-spacing position in Fl_Text_Display. Another bug for wrap_mode(1,0): If the size of Fl_Text_Display window was decreased in both direction, the line will not changed and will not be wrapped according to the new margin.

#39	engelsman 03:20 Apr 21, 2010

> It is a pity if formated CJK text can't be counted according to > the style. Yes, it is, but the original FLTK was probably not designed with extensive non-Latin capabilities in mind. In fltk.development, the concensus over the past couple of days is that FLTK should only try to display Unicode characters. The problem is that most of the FLTK developers have no or limited exposure to working with non-ascii or non-latin text. But as we have seen in this STR, just to display a complex character is not enough: additional knowledge is required about how to join those characters together using specific rules. It has been very useful that people like yourself, with experience in using these characters, could provide such invaluable feedback. However, to implement general capabilities for character composition, "hyphenation" and "line breaking", sorting, bi-directional or right to left text layout is a huge task, and beyond the "fast light" goal of FLTK. Even if we had the manpower, resources and expertise for it. Users who need such specialised functionality are probably already working with libraries such as icu4c or pango, or even something more localised developed specifically for that country or language/script. > For wrap_mode(1,0), CJK text will also be wrapped at non-spacing > position in Fl_Text_Display. This needs further investigation. > Another bug for wrap_mode(1,0): > If the size of Fl_Text_Display window was decreased in both > direction, the line will not changed and will not be wrapped > according to the new margin. When I tested against [a modified version of] the wrap_mode10.cxx file attached to this STR, I saw that the text flowed back and forth as the window was resized. Are CJK characters handled differently? Could you please provide a modified version of wrap_mode10.cxx that is populated with CJK characters, and report your findings?

#40	sparkaround 08:17 Apr 21, 2010

Thanks for the explanation. Please see "wrap_mode10_cjk_v2.png" for the wrapping at non-spacing and "wrap_mode10_cjk_v2.cxx" for the source code. "wrap_mode10_cjk_v2.cxx" is a modified version of "wrap_mode10.cxx". Please ignore "wrap_mode10_cjk.cxx" and "wrap_mode10_cjk.png". The resized problem is not CJK specific. Please see "wrap_mode10_decreased.png". It is possible that redraw was not called when the size of the window in both x-direction and y-direction was decreased. And there is no problem as the size of width or height of the window increases.

#41	engelsman 12:41 Apr 21, 2010

In the "Unicode character display page" thread on fltk.development http://www.fltk.org/newsgroups.php?gfltk.development+v:10221 Albrecht points out that Fl_Text_Editor still has problems on Windows. Are you also running on Windows? [It's not written in this STR]

#42	engelsman 13:36 Apr 21, 2010

I don't have the fonts installed to display wrap_mode10_cjk_v2.cxx properly, but I viewed characters U+6653 and above using the site: http://www.decodeunicode.org/en/u+6653/properties It doesn't report any obvious differences in character properties, so they should all be handled in the same way. But if I compile and run your example, I see [] for the characters, and there are a couple of "line-breaks" that remain in force, even as the window is enlarged or shrunk. More investigation required...

#43	engelsman 22:24 Apr 21, 2010

The small test program below [runs on Linux, don't know about Win/Mac] gives the results that I expected, namely the system wcwidth() returns -1 because I don't have a CJK-aware locale, and both the fl_wcwidth_() and fl_wcwidth() return 2. So it looks like there's another display bug triggered by wrap_mode(1,0) for at least some CJK characters. <pre> #include <stdio.h> #include <wchar.h> #include <FL/Fl.H> #include <FL/fl_utf8.h> int main(int argc, char* argv[]) { char utf8[10]; for (unsigned int ucs = 0x6653; ucs < 0x66a3; ucs++) { int len = fl_utf8encode(ucs, utf8); printf("ucs: \\x%x", ucs); printf(" "); printf("utf-8: "); for (int i = 0; i < len; i++) printf("\\x%hhx", utf8[i]); printf(" "); printf("wcwidth: %2d", wcwidth((wchar_t)ucs)); printf(" "); printf("mkwidth: %2d", fl_wcwidth_(ucs)); printf(" "); printf("flwidth: %2d", fl_wcwidth(utf8)); printf(" "); printf("\n"); } return 0; } </pre>

#44	sparkaround 06:08 Apr 27, 2010

>It doesn't report any obvious differences in character properties, so they should all be handled in the same way. But if I compile and run your example, I see [] for the characters, and there are a couple of "line-breaks" that remain in force, even as the window is enlarged or shrunk. More investigation required... Yes, some cjk characters were treated as "line-breaks" if only wrap_mode(1,0) is enable. It seemed that I am wrong for the shrunk problem. I also found many shrunk problem in the demo code. There is a patch for the shrunk problem (http://www.fltk.org/str.php?L2039). This patch works well on Fvwm at least.

#45	sparkaround 06:13 Apr 27, 2010

I have posted in that thread on fltk.development The problem of Fl_Text_Editor is even worse on windows.

#46	sparkaround 06:38 Apr 27, 2010

I tested the cjk characters near the first "line-break" with a modified version of your small progam. Both wcwidth and fl_wcwidth give "2". The ucs2 value of the characters tested are (little endian): "\xdb\x56" "\x27\x59" "\x86\x76" "\x7a\x7a".

#47	AlbrechtS 07:43 Apr 27, 2010

WRT wrapping CJK characters: please see also STR #2162 [1]. There's a patch [2] from Timothy Lee that includes a new function: // Returns non-zero value if character allows line-break after it int fl_is_linebreak(unsigned int ucs) { ... } I don't know if this is correct or if it would help wrapping at the correct character positions, but I thought it would be worth to be mentioned here. As it appears on a first glance, this allows wrapping after all CJK characters and maybe more. [1] http://www.fltk.org/str.php?L2162 [2] http://www.fltk.org/strfiles/2162/fltk-1.3-cjk_wrap.patch

#48	engelsman 09:53 May 17, 2010

Attached unidecode.cxx, which is both a utility for displaying various info about a UCS character, and a testbed for the wrap_mode() problems. Writing this uncovered STR 2349. See http://www.fltk.org/str.php?L2349

#49	matt 00:37 Nov 08, 2010

This bug is getting very close to be solved for good. When using a fixed font, wrapping works well with all unicode characters. A solution for variable width fonts is close though!

#50	matt 15:33 Nov 09, 2010

Fixed in Subversion repository.