FLTK logo

STR #2158

FLTK matrix user chat room
(using Element browser app)   FLTK gitter user chat room   GitHub FLTK Project   FLTK News RSS Feed  
  FLTK Library      Forums      Links      Apps     Login 
 Home  |  Articles & FAQs  |  Bugs & Features  |  Documentation  |  Download  |  Screenshots  ]
 

Return to Bugs & Features | Roadmap 1.3 | SVN ⇄ GIT ]

STR #2158

Application:FLTK Library
Status:1 - Closed w/Resolution
Priority:5 - Critical, e.g. nothing working at all
Scope:3 - Applies to all machines and operating systems
Subsystem:Core Library
Summary:In Fl_Text_Display, word wrapping do not work when mixing with unicode ata .
Version:1.3-current
Created By:sparkaround
Assigned To:matt
Fix Version:1.3.0 (SVN: v7812)
Update Notification:

Receive EMails Don't Receive EMails

Trouble Report Files:


Name/Time/Date Filename/Size  
 
#1 engelsman
13:24 Apr 04, 2009
wrap_text.cxx
3k
 
 
#2 sparkaround
02:27 Mar 24, 2010
textdisplay.cxx
0k
 
 
#3 sparkaround
02:27 Mar 24, 2010
textdisplay.png
2k
 
 
#4 sparkaround
03:57 Mar 24, 2010
textdisplay2.png
2k
 
 
#5 engelsman
13:40 Mar 29, 2010
wrap_text2.cxx
3k
 
 
#6 engelsman
13:35 Mar 30, 2010
str2158.zip
60k
 
 
#7 sparkaround
22:57 Mar 31, 2010
texteditor.png
4k
 
 
#8 engelsman
13:57 Apr 01, 2010
wrap_mode10.cxx
2k
 
 
#9 engelsman
13:57 Apr 01, 2010
wrap_mode10.png
11k
 
 
#10 engelsman
13:30 Apr 11, 2010
str2158b.zip
58k
 
 
#11 sparkaround
20:12 Apr 20, 2010
7551wrap10.png
6k
 
 
#12 sparkaround
07:51 Apr 21, 2010
wrap_mode10_cjk.png
9k
 
 
#13 sparkaround
07:52 Apr 21, 2010
wrap_mode10_cjk.cxx
1k
 
 
#14 sparkaround
07:57 Apr 21, 2010
wrap_mode10_cjk_v2.cxx
1k
 
 
#15 sparkaround
07:57 Apr 21, 2010
wrap_mode10_cjk_v2.png
8k
 
 
#16 sparkaround
08:11 Apr 21, 2010
wrap_mode10_decreased.png
8k
 
 
#17 engelsman
09:48 May 17, 2010
unidecode.cxx
7k
 
     

Trouble Report Comments:


Name/Time/Date Text  
 
#1 sparkaround
14:37 Feb 18, 2009
Fl_Text_Display widget didn't wrap word at the same column in after I enabled word wrapping with code like this:  'textdisplay->wrap_mode(1,80);'

It seemed that the unicode data was counted by bytes in 'UTF-8' encoding.
But
1.The unicode chars in utf8 encoding have different size in bytes.
2.On unicode environment, ascii was probably rendered with different font from the others or at least with glyph in different width from the others in pixels.
 
 
#2 engelsman
13:23 Apr 04, 2009
I've produced a test program that seems to have the behaviour that
you describe, where the string is wrapped at the byte number rather
than the character number. [There is a bug in the test code where I
could not enter the correct three byte UTF-8 value, but the code does
show the basic principle.]

I noticed that in the code for calculating the display width of
UTF-8 strings there is an optimisation when using fixed width fonts,
hence the Helvetica and Courier buttons on the test program, but the
font change does not make any difference to the wrap position.

This program was written on Linux, using fltk-1.3.x-r6741.
 
 
#3 AlbrechtS
17:41 Apr 04, 2009
U+0080 is _not_ the Euro sign (correct is U+20AC).

Try this instead:

char utfTwoByte[]   = "....\xc2\xb5.... ";      // U+00b5 (Micro sign µ)
char utfThreeByte[] = "....\xe2\x82\xac.... ";  // U+20ac (Euro sign €)

or maybe:

char utfTwoByte[]   = "....\xc2\xa4.... ";      // U+00a4 (Currency ¤)

The latter is interesting, because ISO-8859-15 defines x'a4' to be the
Euro sign.
 
 
#4 engelsman
00:13 Apr 05, 2009
As you know, I'm no expert on non-ASCII character encodings.
I just got the values from Markus Kuhn's Unicode FAQ:
http://www.cl.cam.ac.uk/~mgk25/unicode.html
or related pages, but as I've had problems even getting the
character codes into the program, I've probably messed it up.

I also wonder/doubt whether I have the correct fonts loaded
to be able to display the characters properly anyway...

D.
 
 
#5 greg.ercolano
00:56 Apr 05, 2009
If your browser can display it, you probably have the right fonts at least somewhere on your system.

Under linux, you probably have to tweak around with Fl::setfont() at the head of your app to get FLTK to show the utf8 strings correctly, since by default it seems to select fonts with limited glyphs.

I always eventually get things to work by noodling around with xfontsel() and doing 'locate iso-8859' and sniffing through the 'man -k iso-8859' man pages, and by having samples of the font copy/pasted from a website or customer having the problem, and trying different fonts from xfontsel until it works.

Usually it's a matter of tracking down the right font to stick in Fl::set_font().. each rev of linux seems a bit different. With XFT, I've been able to get e.g. japanese to work with:

        Fl::set_font(FL_COURIER, "Kochi Gothic");

Without XFT, I usually have to mess with xfontsel. For European fonts, see eg. 'man iso-8859-1' which lists what the last digit means.
 
 
#6 ianmacarthur
03:56 Aug 28, 2009
I seem to be very late to this particular party, which is bad because it is entirely possible that some of the issues flagged are my fault...

Anyway, trawling through the STR, there are a few oddities and misconceptions, so in the hope that I can add some clarity I offer the following thoughts...

The Euro symbol:

The sample code uses U+0080 for this, which is wrong. Sort of. Here comes the mess...
The Unicode spec defines the range 0080 to 009F as non-printing control characters, but it seems like no one ever uses them.
Now... lots of web pages that claim to be UTF8 are actually in some MS code page, derived from CP125x, which *does* use the range 0080 to 009F for characters, and in CP125x 0080 *is* the Euro symbol.

So there is a sort of unofficial convention to assume that anything in the 0080 to 009F range is actually CP125x and map accordingly. Even fltk has some support for this, see the file fl_utf.c and search for the define ERRORS_TO_CP1252, which does exactly that.

However, I do not think we have ERRORS_TO_CP1252 enabled by default. So in fltk, 0080 is *not* a Euro sign, although it possibly could be...

As Albrecht observes later in the thread, a more appropriate choice for the Euro sign is U+20AC, but just to add further confusion, the code point U+20A0 is define as "EURO-CURRENCY SIGN", which is *not* the Euro symbol itself, but a standard symbol for currency... Confused yet?

Anyway... The odd wrapping kind of looks as if we are just wrapping a wee bit early on the lines with non-ASCII glyphs? It sort of mostly works, but is just a bit premature?
Is that what others are seeing?

For the record, I'm actually trying this on a WinXP box right now, so XFT behaviour might well be different...!

The OP doesn't say what platform they are on, but does mention that the non-ASCII chars seem to be rendered in a different font from everything else.
This possibly implies the OP is on Vista or OSX, either of which can do automatic substitution of fonts to handle "missing" glyphs.

Other platforms (XFT, Windows prior to Vista, etc.) don't do this auto-font-substitution but if we really want to, we probably can make them do it (albeit with a bit of extra work in fltk...)

Now, the OP speculates that the wrap code is counting the physical bytes rather than the glyphs, and this is quite possibly true.

The file fl_utf8.cxx provides the functions fl_utf8len() and fl_utf_nb_char() that we probably ought to be using, and I have a vague recollection that the text-editor stuff possibly has it's own private, and slightly different, versions of these functions too.
No, I don’t know why.



Lastly, Greg has some comments about how to find a font that can render his text; For what it is worth, I have (somewhere) a program that I wrote for XFT+fltk, that will tell you whether a font has the glyphs needed to render your text... If I can find it, you are welcome to a copy.
In principle, it is "trivial" to extend that code to search all the installed fonts to find the one that has the fewest missing glyphs...

Also, in principle, it is possible with XFT to construct a "super font" as a "union" of several actual fonts, to obtain the extra glyph coverage. Fltk does not currently support that, but there are patches posted (by Timothy Lee IIRC) that take us in that direction... Maybe we need to look at those again?
 
 
#7 engelsman
00:49 Dec 06, 2009
I think that I might have isolated the problem in this particular
case, but I'm not sure whether any issues might pop up elsewhere
in Fl_Text_Display and Fl_Text_Buffer handling.

<pre>
int Fl_Text_Buffer::character_width(char c, int indent, int tabDist, char nullSubsChar) {
  /* Note, this code must parallel that in Fl_Text_Buffer::ExpandCharacter */
  if (c == '\t')
    return tabDist - (indent % tabDist);
  else if (((unsigned char) c) <= 31)
    return strlen(ControlCodeTable[ (unsigned char) c ]) + 2;
  else if (c == 127)
    return 5;
  else if (c == nullSubsChar)
    return 5;
  else if ((c & 0x80) && !(c & 0x40))
    return 0;
  else if (c & 0x80) {
    return fl_utf8len(c);
  }
  return 1;
}
</pre>

I think that the final call to fl_utf8len() is wrong in this case.
A Unicode character represented as a UTF-8 byte sequence consists
of a header byte that indicates how many bytes are in the sequence,
and one or more non-header bytes. fl_utf8len() returns the number
of bytes in the sequence if passed the header byte, and zero for
the non-header bytes.

The code above appears to correctly handle tabs, ASCII control and
"nul" characters, and the UTF-8 non-header byte. However, instead
of treating the UTF-8 header byte as indicating that the UTF-8 byte
sequence expands to ONE character, it expands it to the number of
bytes in the sequence.

Changing "fl_utf8len(c)" to "1" in the above solves this problem.

The Fl_Text_Display and FL_Text_Buffer code is hard to understand,
and it may be that there are other areas where similar logic needs
to be investigated.

D.
 
 
#8 engelsman
09:29 Dec 13, 2009
Oops! I wrote:
    "fl_utf8len() returns the number of bytes in the sequence if
    if passed the header byte, and zero for the non-header bytes."

but in fact it returns -1 for a non-header byte. However, in the
Fl_Text_Buffer::character_width() code show, the non-header byte
was already handled in the "if ((c & 0x80) && !(c & 0x40))" line.

As far as I can tell, the other uses of fl_utf8len() appear correct,
but it might make it more obvious if there were functions (macros?)
such as:

fl_uft8Byte(c)      returns (c & 0x80)
fl_utf8FirstByte(c) returns ((c & ox80) && (c & 0x40))
fl_utf8OtherByte(c) returns ((c & 0x80) && !(c & 0x40))

or whatever better naming scheme someone can come up with.

fl_draw.cxx already has C_IN(c) and C_UTF8(c) macros for some of it.
 
 
#9 sparkaround
02:39 Mar 24, 2010
This problem still exists in fltk-1.3.x-r7302.
I just uploaded a test code "textdisplay.cxx" together with the screenshot ("textdisplay.png"). I expected that the wide chars can be count by width of the char but not the bytes in utf8 encoding.

And I found a new problem: the line with colorful characters will be wrapped earlier than the common line with the same font-size as if the line-wrap-margin is shorten.
 
 
#10 engelsman
03:44 Mar 24, 2010
I proposed a solution on the 6th December, which is a one-line fix,
but I don't think that this has been changed in svn.

Can you please modify Fl_Text_Buffer::character_width() in your local
sources as described above [if looking at the STR view, not fltk.bugs]
and confirm whether this solves your problem or not?

Cheers
D.
 
 
#11 sparkaround
04:08 Mar 24, 2010
Thank you. The result really changed when applying the fix. But this problem haven't solved completely.
I have upload the result as "textdisplay2.png".

In my example, the width of the selected wide character is 2. And after I replace the value with 2 instead of 1, the example code works correctly:

  else if (c & 0x80) {
    //return fl_utf8len(c);
    //return 1;
    return 2;
  }

But this will only work for wide character in width 2. Is there any methods to fix this generally?
Thanks again.
 
 
#12 engelsman
05:21 Mar 24, 2010
It's 4 months since I looked at the code, and I didn't really understand
it properly then because it's all recursive and poorly refactored, but
if I remember correctly, there are different paths through the code for
constant width fonts and for proportional width fonts.

So it looks like my fix catches the problem with calculating the correct
number of characters (rather than bytes) for rendering in a fixed width
font. Now we need to look at handling the pixel width of the characters
when rendered in a proportional font.

I'm afraid there won't be a quick fix because:
- I don't have any spare time at the moment to look into it;
- I don't have the same locale and Chinese(?) fonts as you;
- I've forgotten how the code works;
- most important of all, this is probably the most obscure and poorly
  structured code that I've ever worked with...

D.
 
 
#13 engelsman
12:08 Mar 27, 2010
I've just started to look at the code again, and I could be wrong,
but it seems that the different code paths for proportional and
fixed width fonts has more to do with calculating maximum display
width for use with scrollbars. Everywhere else seems to look at
wrapping at a given column number, so it could be that your second
problem is just a feature of how the code works and not a bug.

But there's still a lot of code to check yet...
 
 
#14 sparkaround
23:15 Mar 27, 2010
Thank you!

And I also found that when "wrap_mode" was called with margin=0(ex: wrap_mode(1,0)), the text will be wrapped based on the pixel width of the current widget. The style(font size) will be considered, too!
But it is pity that wrap_mode(1,0) for wide character is even worse than the original wrap_mode(1,20). The line composed of wide characters will never be wrapped!

It seemed that wrap_mode(1,0) use a method like(use duplicate code but not share the same call?) the one to decide whether the scrollbar should be visible. This is done with the help of font-api but not utf8len/widthchar leng.  Yes, this is another wrap-method.

But for the original wrap method(wrap_mode(1,N), N>0), I still think it is a bug?. There the width of a line was counted based on bytes in utf8 encoding, but not based on real number of characters(as your one-line fixed) or based on column-width of wide characters (some methods like api "wcwidth()").
 
 
#15 AlbrechtS
01:54 Mar 28, 2010
FWIW, just a few notes to make clear what I think to see from the example pictures.

In FLTK 1.1 wrapping (wrap_mode(1,N), N>0) works only with fixed fonts correctly, because it only counts bytes (columns) of text.

Duncan, from what I can see, I think that your patch is almost correct, but only for LGC characters, where each glyph occupies one column position. Sparkaround showed an example with CJK characters, where each character occupies two "normal" (LGC) character positions, and hence his patch version with 2 columns instead of 1 column for each character works better for his example.

Conclusion(?): There is no such thing like a "fixed font width" in UTF-8, because LGC and CJK characters have _different_ widths.

Possible solution: to avoid calling fl_measure() or some other expensive method, we could maybe calculate the Unicode code point and use a table to decide whether that particular character has a width of 1 or 2 columns.

Sparkaround: Looking back at your post, I think that I just described what wcwidth() is intended to do. And for the record:

http://www.opengroup.org/onlinepubs/009695399/functions/wcwidth.html

But note in the above text: "This function was removed from the final ISO/IEC 9899:1990/Amendment 1:1995 (E), and the return value for a non-printable wide character is not specified."

There is also Markus Kuhn's free implementation:

http://www.cl.cam.ac.uk/~mgk25/ucs/wcwidth.c
 
 
#16 engelsman
04:13 Mar 28, 2010
If you poke around on the NEdit web pages, there's the comment somewhere,
and I paraphrase here, that proportional fonts are not handled well
because nedit is really a programmers' editor, designed for use with
source code, with fixed width fonts and columns of characters. I think,
therefore, that we might be trying to shoe-horn something into this code
that is never going to fit properly. The code is not exactly clear and
easily maintainable now, never mind adding special case code. But then,
as the latest release of NEdit was 5.5 in 2004, so it looks unlikely
that there will be new releases. So we don't need to remain compatible.

Unless we provide a completely new widget, written from the ground up,
we should probably be looking at making some clear statements in the
documentation (for wrap_mode() perhaps?) that explain that:

- wrapping is based on glyph pixel width only (no columns)
- wrapping is based on 1 column per character for fixed width LGC fonts
- wrapping is based on 2 column per character for fixed width CJK fonts
- mixing the last two may produce unexpected results.

Maybe we could extend wrap_mode() to have an additional parameter that
specifies the number of columns per character, with 0 for mixed fonts.
Just thinking aloud...
 
 
#17 engelsman
06:29 Mar 28, 2010
OK, been looking at this a bit further...

If we add Markus Kuhn's wcwidth.c into src/xutf8 for testing, we still
need to have a wrapper function that takes a char* pointing to the start
of the UTF-8 sequence, converts it to UCS, and then calls mk_wcwidth().

Plus, we will need to rejig FL_Text_Buffer because at the moment we have
  int Fl_Text_Buffer::expand_character(char c, ...) and
  int Fl_Text_Buffer::character_width(char c, ...)
If we overload these to have versions that take a char* we might be able
to do it, but more investigation required.
 
 
#18 engelsman
10:42 Mar 28, 2010
After a lot of dicking around trying to add wcwidth.c to libfltk_xutf8.a
and failing, I finally just #include'd it directly in Fl_Text_Buffer.cxx
and set up:
<pre>
int Fl_Text_Buffer::expand_character(const char* s, int indent, char *outStr, int tabDist,
                                      char nullSubsChar) {
  char c = *s;
  if ((c & 0x80) && (c & 0x40)) {
    int len = fl_utf8len(c);
    int ret = 0;
    unsigned int ucs = fl_utf8decode(s, s+len, &ret);
    int width = mk_wcwidth((wchar_t)ucs);
    fprintf(stderr, "mk_wcwidth(%x) -> %d\n", ucs, width);
    return width;
  }
  return expand_character(c, indent, outStr, tabDist, nullSubsChar);
}

int Fl_Text_Buffer::character_width(const char* s, int indent, int tabDist, char nullSubsChar) {
  char c = *s;
  if ((c & 0x80) && (c & 0x40)) {
    int len = fl_utf8len(c);
    int ret = 0;
    unsigned int ucs = fl_utf8decode(s, s+len, &ret);
    int width = mk_wcwidth((wchar_t)ucs);
    fprintf(stderr, "mk_wcwidth(%x) -> %d\n", ucs, width);
    return width;
  }
  return character_width(c, indent, tabDist, nullSubsChar);
}
</pre>
to call the existing versions, and started to substitute the calls
for the originals with the new ones, but then you need to create
a lookalike for Fl_Text_Buffer::character(pos) that returns a char*
and so on. I got so far, but then the textdisplay example stopped
displaying the chinese characters, even though I could see that the
above were calling mk_wcwidth(0x54c8) and returning 2.

Then I discovered that there's a whole load of calculations based on
reverse mapping from column numbers on the display to pixel widths.
At this point, I realised this is just one big can of worms, and it
probably means that my other idea for adding a column indicator to
wrap_mode() will probably be difficult to get working too :-(
 
 
#19 engelsman
12:22 Mar 28, 2010
Sorry guys, bit of brain-fade there earlier. I only needed the extra
character_width() implementation, not the expand_character().

That gives me two rows of 10 'a' and 4 rows of 5 '[]' because I don't
have the correct chinese locale and fonts.

It still might not work correctly as part of Fl_Text_Editor with the
reverse character to column mapping though.

I haven't tested it beyond the 'textdisplay.cxx' at the moment, and I
will need to tidy up the source before I paste it here for testing in
sparkaround's real environment...
 
 
#20 engelsman
13:50 Mar 29, 2010
Just posted wrap_text2.cxx, a correction to wrap_text.cxx where the
three utf8* string variables now contain dots, ucs, dots, space with:
U+0024 (\x24)         dollar
U+00A9 (\xc2\xa9)     copyright sign
U+20AC (\xe2\x82\xac) euro currency sign
after spending ages wondering why Markus Kuhn's wcwidth() returned -1
for the 2-byte literal euro symbol I had cut'n'pasted from somewhere.
As Albrecht and Ian point out above, the euro wasn't what I thought.
 
 
#21 engelsman
13:59 Mar 30, 2010
I've just appended str2158.zip to the STR as a proof of concept.
The zip file contains hacked versions of the following:

- FL/Fl_Text_Buffer.H
  with declarations for characterP() and character_width(const char*...)

- src/Fl_Text_Buffer.cxx
  with their implementations and calls to the new routines

- src/Fl_Text_Display.cxx
  with calls to the new routines

- src/xutf8/wcwidth.c
  Marcus Kuhn's wcwidth() implementation as discussed above

Unfortunately, I've had these files hanging around for a while since
I started the debugging in November 2009, so there are various other
changes that show up if you just try to diff against the originals,
However, if you search for the characterP() and character_width()
calls, you can see that in most cases the following line is the a
comment containing the original line preceded by a drg: tag.

As I say, this is a dirty hack to check proof of concept, and don't
know how complete it really is, or whether other problems will arise
when tested against more text. As mentioned yesterday, I found that
the euro character I had cut'n'pasted from somewhere into wrap_text.cxx
was not really the valid UTF-8 encoding of the correct UCS character,
and comes from the fact that HTML browsers often infer that 0x80 must
come from the Windows-1252 rather that ISO-8859-1. Marcus Kuhn's code
correctly returns -1 because 0x80 should be the PAD character...

Unpack the zip file in a temp directory to be on the safe side, and
make sure you save the original files before overwriting them. I've
included wcwidth,c directly into Fl_Text_Buffer.cxx rather than add
it to the libraries and link it in. This can be tidied up later.

Does this solve your problem, sparkaround?
.
 
 
#22 sparkaround
19:28 Mar 30, 2010
Thank you so much! It works well for ascii character and cjk words at least (wrap_mode(1,N); //N>0).

By the way, when wrap_mode(1,0)(wrapped by pixel) is called characters  are wrapped according to the pixel width of the characters displayed. That is userful for styled text if the bug for "wrap_mode(1,0)" is fixed too. Should I open another topic for that bug?  Thanks.
 
 
#23 engelsman
00:11 Mar 31, 2010
Have you tested the patch against more text than in textdisplay.cxx?
In any case, it looks like we're slowly eliminating problems :-)

As for the wrap_mode(1, 0) problem, I'll need to poke around in the
code to see what's happening as I haven't looked at that so far.

And, yes, you are right. There is a lot of almost-duplicate code in
there, several loops with an extra if statement in the header, or
tangled with the loop body, so it's hard to get a handle on all code
paths. And did I alread say that this code was highly recursive... :-)
 
 
#24 sparkaround
23:13 Mar 31, 2010
I tested the patch in Fl_Text_Display window and Fl_Text_Editor window, but only for CJK and alphabetic characters.

Now I begin to test the characters mixing with those in AlbrechtS's example. And I found that "¤"(\xc2\xa4) was not wrapped correctly either on Fl_Text_Display or on Fl_Text_Editor. Please see "texteditor.png".

It seemed that wrap_mode(1,0) can also be fixed,:). Thanks.
 
 
#25 engelsman
00:02 Apr 01, 2010
can you tell me what you think that wrap_mode(1,0) does, or should do?
the comment above the routine is not clear, nor in step with the code,
so it could be that what you want it to do is not what it really does.
 
 
#26 sparkaround
00:41 Apr 01, 2010
I wish the characters can be wrapped by the real pixel width on screen.
The style or any others that can affect the width on screen when the characters are rendered should be considered.
  
A simple implementation is just to wrap the characters at the current margin of the the widget and the horizontal scrollbar can be visible always. And if the widget is resized by the user, the line will be wrapped according to the new width of the wiget at once. I think this kind of implementation is very userful and popular in other toolkits.

Even more. If it can accept a argument such as max_pix_width? That is to say, if the width of the widget is larger than max_pix_width, the text will also be wrapped by the argument.

Thanks.
 
 
#27 engelsman
01:25 Apr 01, 2010
I'm sorry, maybe I wasn't clear enough. I wasn't asking for a feature
request, I was asking what you thought wrap_mode() would do. Basically,
what would you expect to see in the documentation for:

- wrap_mode(0,0)
- wrap_mode(0,N) where N > 0
- wrap_mode(1,0)
- wrap_mode(1,N) where N > 0

The current description is clearly inadequate and therefore confusing.
There is no mention of characters, pixels. widths or margins:

  "If mode is not zero, this call enables automatic word wrapping at
   column pos."

There are two possible issues here. You may expect wrap_mode(1,0) to
do one thing, but in fact it is designed to do something completely
different, which is a documentation problem. On the other hand, you
may expect wrap_mode(1,0) to do one thing, and it is designed to do
that, but it doesn't do what it is supposed to do, which is a bug.

I still have to examine the code properly to work out what it is doing.
 
 
#28 sparkaround
02:10 Apr 01, 2010
Here is my lame understanding:

- wrap_mode(0,0)
  no wrap
- wrap_mode(0,N) where N > 0
  no wrap
- wrap_mode(1,0)
  wrap by pixe.
  In fact, the characters is wrapped by words firstly. If word-wrap failed,
  the characters will be wrapped by character.
- wrap_mode(1,N) where N > 0
  wrap by character
  Just as above, word-wrapped was tried firstly, then character-wrap.

"you
may expect wrap_mode(1,0) to do one thing, and it is designed to do
that, but it doesn't do what it is supposed to do, which is a bug."

Yes, it is designed to do pixel-wrap without enough documentation, and it doesn't do what it is supposed to do always, which is a bug.

Have I answered your asking?
 
 
#29 engelsman
14:00 Apr 01, 2010
Added wrap_mode10.cxx so that the problem is easily reproducible,
and wrap_mode10.png showing result. Standard ascii chars wrap,
but UTF-8 encoded chars do not. I hope you see the same effect as
you resize the window. I shall start looking at this tomorrow...
 
 
#30 engelsman
06:16 Apr 04, 2010
I started to look at the wrap_mode(1,0) problem, and I think it will
entail a lot more changes along the lines of what I did before. I see
from the FLTK2 code that they work through the buffer one byte at a
time, handle that byte, and then adjust the offset if that byte is the
start of a UTF-8 sequence. On the one hand that's cleaner, but on the
other it still doesn't address the column width problem because that
requires knowing which UCS character we are dealing with.

Therefore I either need to provide a whole series of additional routines
that take a char* rather than a char, as I did above, and let each level
extract the full UTF-8 byte sequence as needed, or I check for the UTF-8
sequence at the top level and then pass the UCS value rather than char
to a series of new routines. Still haven't made up my mind on this one.

On a related note, let's go back to wcwidth() and mk_wcwidth().

The Linux man page for wcwidth() says that it returns 0 for U+0000,
the number of columns needed for printable wide characters, and -1
for non-printable characters. The behaviour also depends on LC_CTYPE.

Markus Kuhn's implementation returns 0 for U+0000, only standard(?)
control characters and DEL return -1, and all other return 0, 1, 2.
There is no reference to locale specifics like LC_CTYPE in the code.

If I build Markus Kuhn's implementation into FLTK-1.3.x xutf8 code
and run the following:

#include <wchar.h>
#include <FL/Fl.H>
#include <FL/fl_utf8.h>

int main(int argc, char *argv[]) {
  for (wchar_t ucs = 0; ucs < 0xFFFF; ucs++) {
    int w1 = wcwidth(ucs);
    int w2 = mk_wcwidth(ucs);
    if (w1 != w2)
      printf("U+%04x: wcwidth()=%2d, mk_wcwidth()=%2d\n", ucs, w1, w2);
  }
  return 0;
}

I can see that there are a lot of characters that return -1 for the
standard wcwidth(), but do return 0, 1, and 2 for mk_wcwidth(). I don't
know whether the first is due to the limited number of locales that I
have installed on this box or not.

Although I have the feeling that providing a platform and locale(?)
neutral implementation has its advantages, I wonder whether it might
cause problems where wcwidth() is already being used elsewhere in the
system. For example, where a system editor and the FLTK app show two
different views of the same file. Should we be worried?

And finally, as far as I can see, there's no reason why we can't just
add Markus Kuhn's wcwidth.c to the xutf8 directory provided we keep
the copyright etc. I don't think there's even a need to edit the file.
Therefore, I have it all ready to be committed in the next few days.

Any comments?
 
 
#31 engelsman
10:18 Apr 11, 2010
Although progress has been made in identifying the causes of the first
two issues raised (number and width of UTF-8 glyphs when wrapping), the
FL_Text_Buffer and FL_Text_Display code is currently being refactored
to remove some old code, and to verify that the UTF-8 code is consistent
and correct. These issues will be considered during the refactoring.

For the wrap_mode(1,0) problem, see http://www.fltk.org/str.php?L2343

D.
 
 
#32 engelsman
13:36 Apr 11, 2010
I've just attached str2158b.zip, which is another hacked exploration,
this time based on Matt's partially refactored code in r7479.
Unpack in a temporary directory so you don't zap current state.

This appears to work for the wrap_text, textdisplay and wrap_mode10
text programs posted earlier, with the proviso that I have no CJK
fonts on my box, so can't see the real output.

There's still a lot of FIXME markers in the code. I also identified
expandTabs() as one area that will never work with UTF-8. Possibly.
 
 
#33 engelsman
15:15 Apr 20, 2010
I've just committed revision 7551 which adds the call to fl_wcwidth()
to the Fl_Text_Buffer::character_width() routine.

I believe that all changes are in place to address the three issues:
- incorrect count of multi-byte UTF-8 characters, affecting margins;
- incorrect column widths for CJK characters
- incorrect wrap_mode(1,0) handling of UTF-8 pixel wrapping

Could you please verify that r7551 solves your issues, and confirm?
 
 
#34 sparkaround
20:22 Apr 20, 2010
Thank you!
I have checked rev 7551.  "wrap_mode(1,N) N>0" works well just as before except for "¤".  While "wrap_mode(1,0)" does not work well sometimes. When I input more CJK characters, the margin of last line will be shorten and part of the characters in that line will be moved to current line.It seemed that word-wrap happened unexpectedly here. Please check 7551wrap10.png.
 
 
#35 engelsman
22:54 Apr 20, 2010
> "wrap_mode(1,N) N>0" works well just as before except for "¤".

Is that one specific character? or a series of characters that
display like that. Can you give the(ir) UCS value(s)?

Are these non-spacing, compose or combining characters?
Or just single CJK characters?
 
 
#36 sparkaround
23:30 Apr 20, 2010
The ucs value of "¤" is "\xa4\x00"
The utf8 value of "¤" is "\xc2\xa4" just as AlbrechtS said.
"¤" is the only specific character. Does this character was wrapped correctly on your box? I guess CJK font was not needed to display this character.

>Are these non-spacing, compose or combining characters?
Or just single CJK characters?

Do you mean wrap_mode(1,0)? They are all single CJK characters. I guess part of CJK characters was treated as spacing. In the image, the utf8 value of the first byte of the CJK character just after the shorten line is '\xe5' or '\xe6'. The Fl_Text_Editor in rev 7551 or snapshot 7513 seemed unstable. There are many new bugs relation to editing so that I can't edit the CJK characters freely to find more characteristics.
 
 
#37 engelsman
01:07 Apr 21, 2010
As far as I remember from the code, the Unicode U+00A4 Currency Sign
should not require any special handling.

As far as the CJK wrapping is concerned, there are 3 reasons I can see:

1. the wcwidth() inplementation has rules for certain characters, such
   as "Hangul Jamo medial vowels and final consonants (U+1160-U+11FF)"
   and returns 0 for the column width instead of 2. For more details.
   see: http://www.cl.cam.ac.uk/~mgk25/ucs/wcwidth.c

2. the Fl_Text_* line-breaking algorithm is too simple, and looks only
   at "latin" space characters as a line-break, and hence wrapping
   point. Are you expecting the lines to wrap at a particular point
   based on your knowledge of CJK?

   Unfortunately, the FL_Text_* line breaking and wrapping is based
   on a limited number of ascii whitespace characters. FLTK does not
   handle text formatting and layout for non-ascii characters, and it
   is unlikely to be added in the near future.

3. It is a bug.

One final observation: until now I have only looked at Fl_Text_Buffer
and Fl_Text_Display. If you are running in Fl_Text_Editor, it may be
that there are extra things to take into account that I have not yet
looked at.
 
 
#38 sparkaround
02:36 Apr 21, 2010
1. wrap_mode(1,N) N>0
Currency Sign will not be used frequently.
If CJK's specific spacing characters can't be considered, I can try wrap_mode(1,0).

For wrap_mode(1,N) N>0, I think that the fixing have been quite well for me. Thanks.

2.wrap_mode(1,0)
>FLTK does not handle text formatting and layout for non-ascii characters, >and it is unlikely to be added in the near future.
As far as I know, formated ascii text was counted well for wrap_mode(1,0). It is a pity if formated CJK text can't be counted according to the style.

For wrap_mode(1,0), CJK text will also be wrapped at non-spacing position in Fl_Text_Display.

Another bug for wrap_mode(1,0):
If the size of Fl_Text_Display window was decreased in both direction, the line will not changed and will not be wrapped according to the new margin.
 
 
#39 engelsman
03:20 Apr 21, 2010
> It is a pity if formated CJK text can't be counted according to
> the style.

Yes, it is, but the original FLTK was probably not designed with
extensive non-Latin capabilities in mind. In fltk.development, the
concensus over the past couple of days is that FLTK should only try
to display Unicode characters. The problem is that most of the FLTK
developers have no or limited exposure to working with non-ascii or
non-latin text. But as we have seen in this STR, just to display a
complex character is not enough: additional knowledge is required
about how to join those characters together using specific rules.
It has been very useful that people like yourself, with experience
in using these characters, could provide such invaluable feedback.

However, to implement general capabilities for character composition,
"hyphenation" and "line breaking", sorting, bi-directional or right
to left text layout is a huge task, and beyond the "fast light" goal
of FLTK. Even if we had the manpower, resources and expertise for it.
Users who need such specialised functionality are probably already
working with libraries such as icu4c or pango, or even something more
localised developed specifically for that country or language/script.

> For wrap_mode(1,0), CJK text will also be wrapped at non-spacing
> position in Fl_Text_Display.

This needs further investigation.

> Another bug for wrap_mode(1,0):
> If the size of Fl_Text_Display window was decreased in both
> direction, the line will not changed and will not be wrapped
> according to the new margin.

When I tested against [a modified version of] the wrap_mode10.cxx
file attached to this STR, I saw that the text flowed back and forth
as the window was resized. Are CJK characters handled differently?
Could you please provide a modified version of wrap_mode10.cxx that
is populated with CJK characters, and report your findings?
 
 
#40 sparkaround
08:17 Apr 21, 2010
Thanks for the explanation.

Please see "wrap_mode10_cjk_v2.png" for the wrapping at non-spacing and "wrap_mode10_cjk_v2.cxx" for the source code.

"wrap_mode10_cjk_v2.cxx" is a modified version of "wrap_mode10.cxx".
Please ignore "wrap_mode10_cjk.cxx" and "wrap_mode10_cjk.png".

The resized problem is not CJK specific. Please see "wrap_mode10_decreased.png".
It is possible that redraw was not called when the size of the window in both x-direction and y-direction was decreased.  And there is no problem as the size of width or height of the window increases.
 
 
#41 engelsman
12:41 Apr 21, 2010
In the "Unicode character display page" thread on fltk.development
http://www.fltk.org/newsgroups.php?gfltk.development+v:10221

Albrecht points out that Fl_Text_Editor still has problems on Windows.

Are you also running on Windows? [It's not written in this STR]
 
 
#42 engelsman
13:36 Apr 21, 2010
I don't have the fonts installed to display wrap_mode10_cjk_v2.cxx
properly, but I viewed characters U+6653 and above using the site:
http://www.decodeunicode.org/en/u+6653/properties

It doesn't report any obvious differences in character properties,
so they should all be handled in the same way. But if I compile and
run your example, I see [] for the characters, and there are a couple
of "line-breaks" that remain in force, even as the window is enlarged
or shrunk. More investigation required...
 
 
#43 engelsman
22:24 Apr 21, 2010
The small test program below [runs on Linux, don't know about Win/Mac]
gives the results that I expected, namely the system wcwidth() returns
-1 because I don't have a CJK-aware locale, and both the fl_wcwidth_()
and fl_wcwidth() return 2. So it looks like there's another display bug
triggered by wrap_mode(1,0) for at least some CJK characters.

<pre>
#include <stdio.h>
#include <wchar.h>

#include <FL/Fl.H>
#include <FL/fl_utf8.h>

int main(int argc, char* argv[]) {
  char utf8[10];
  for (unsigned int ucs = 0x6653; ucs < 0x66a3; ucs++) {
    int len = fl_utf8encode(ucs, utf8);
    printf("ucs: \\x%x", ucs);
    printf("  ");
    printf("utf-8: ");
    for (int i = 0; i < len; i++)
      printf("\\x%hhx", utf8[i]);
    printf("  ");
    printf("wcwidth: %2d", wcwidth((wchar_t)ucs));
    printf("  ");
    printf("mkwidth: %2d", fl_wcwidth_(ucs));
    printf("  ");
    printf("flwidth: %2d", fl_wcwidth(utf8));
    printf("  ");
    printf("\n");
  }
  return 0;
}
</pre>
 
 
#44 sparkaround
06:08 Apr 27, 2010
>It doesn't report any obvious differences in character properties,
so they should all be handled in the same way. But if I compile and
run your example, I see [] for the characters, and there are a couple
of "line-breaks" that remain in force, even as the window is enlarged
or shrunk. More investigation required...

Yes, some cjk characters were treated as  "line-breaks" if only wrap_mode(1,0) is enable.

It seemed that I am wrong for the shrunk problem. I also found many shrunk problem in the demo code.
There is a patch for the shrunk problem (http://www.fltk.org/str.php?L2039). This patch works  well on Fvwm at least.
 
 
#45 sparkaround
06:13 Apr 27, 2010
I have posted in that thread on fltk.development
The problem of Fl_Text_Editor is even worse on windows.
 
 
#46 sparkaround
06:38 Apr 27, 2010
I tested the cjk characters near the first "line-break" with a modified version of your small progam. Both wcwidth and fl_wcwidth give "2".

The ucs2 value of the characters tested are (little endian):
"\xdb\x56" "\x27\x59" "\x86\x76" "\x7a\x7a".
 
 
#47 AlbrechtS
07:43 Apr 27, 2010
WRT wrapping CJK characters: please see also STR #2162 [1]. There's a patch [2] from Timothy Lee that includes a new function:

// Returns non-zero value if character allows line-break after it
int fl_is_linebreak(unsigned int ucs) { ... }

I don't know if this is correct or if it would help wrapping at the correct character positions, but I thought it would be worth to be mentioned here. As it appears on a first glance, this allows wrapping after all CJK characters and maybe more.

[1] http://www.fltk.org/str.php?L2162
[2] http://www.fltk.org/strfiles/2162/fltk-1.3-cjk_wrap.patch
 
 
#48 engelsman
09:53 May 17, 2010
Attached unidecode.cxx, which is both a utility for displaying various
info about a UCS character, and a testbed for the wrap_mode() problems.
Writing this uncovered STR 2349. See http://www.fltk.org/str.php?L2349
 
 
#49 matt
00:37 Nov 08, 2010
This bug is getting very close to be solved for good. When using a fixed font, wrapping works well with all unicode characters. A solution for variable width fonts is close though!  
 
#50 matt
15:33 Nov 09, 2010
Fixed in Subversion repository.  
     

Return to Bugs & Features ]

 
 

Comments are owned by the poster. All other content is copyright 1998-2026 by Bill Spitzak and others. This project is hosted by The FLTK Team. Please report site problems to 'erco@seriss.com'.