STR #2822: Fl_Input UTF-8 handling - Fast Light Toolkit (FLTK)

STR #2822

Application:	FLTK Library
Status:	5 - New
Priority:	1 - Request for Enhancement, e.g. asking for a feature
Scope:	3 - Applies to all machines and operating systems
Subsystem:	Unicode support
Summary:	Fl_Input UTF-8 handling
Version:	1.4-feature
Created By:	chris
Assigned To:	Unassigned
Fix Version:	Unassigned
Update Notification:	Receive EMails Don't Receive EMails

Trouble Report Files:

Post File

No files

Trouble Report Comments:

Post Text

#1	chris 10:37 Apr 12, 2012

Using Fl_Input::insert() with real UTF-8 strings, I found the documentation and/or the functionality of the positional parameters not clear/sufficient for my purpose. Citing my email to fltk.general from 2012/04/12: ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ I have a question in regarding use of the Fl_Input::insert() method in combination with UTF-8 strings. If I want to insert a text after the second UTF-8 character, it would seem natural to use: ---- snip ---- #include <FL/Fl_Input.H> #include <stdio.h> int main() { Fl_Input t(0,0,0,0); t.value( "ДБИЯ" ); t.position( 2 ); printf( "t.value(): '%s' size=%d\n", t.value(), t.size()); t.insert( "Ж" ); printf( "t.value(): '%s' size=%d\n", t.value(), t.size()); printf( "t.position(): %d\n", t.position()); } ---- snip ---- But it looks like position(2) sets the position to the byte-offset 2 and not after the second *UTF-8 character*, as the outcome is: ---- snip ---- t.value(): 'ДБИЯ' size=8 t.position(): 2 t.value(): 'ДЖБИЯ' size=10 t.position(): 4 ---- snip ---- How is it supposed to be done and is this the desired behaviour? ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ I don't know if a change in behaviour, so that position means 'UTF-8 character position' and NOT 'byte offset', will not break the API, so this might be an issue for 1.4 or 3.0. Also I imagine, that other methods of Fl_Input could suffer the same issues (size(),...)

#2	chris 04:38 Apr 14, 2012

Regarding documentation: Forget my complaint - I had looked at the current document on the FLTK website, which is from 9/2011. Since then the in-source documentation of Fl_Input_ has improved considerably in this aspect and makes it clear now, how the positional parameters are to be used. So what's remaining is an RFE to extend the API to make it possible to use UTF-8 character positions as arguments to Fl_Input::replace() instead of byte offsets. Fl_Text_Buffer::replace() would then be another candidate for such an improvement.

#3	ianmacarthur 16:32 Mar 11, 2014

Though that change might be ABI breaking, so not for 1.3...? Thoughts? Do we close this, move it to 1.4, or other?

#4	ianmacarthur 15:59 Sep 04, 2014

Ping: Is this STR still "active", or can we consider it for closure as "won't fix", or move it to 1.4 or something...?

#5	chris 22:37 Sep 04, 2014

... Pong! I would say move it to 1.4. From my view it is still a MUST for a toolkit that operates with UTF-8 to hide the byte-positions from the usercode and go for character-positions in all its API. But if think I am wrong with this view, you may also close it - I have my wrappers in place...

#6	ianmacarthur 03:47 Sep 05, 2014

I'll move it to 1.4 for now. Our desire to maintain the same API as 1.1 makes changing to glyph rather than byte based positions is tricky, but it would be a better option in a future variant!

#7	AlbrechtS 12:12 Sep 18, 2014

Changed priority to RFE, since this is what it is (now). Summary: (1) the documentation improvements are satisfied (2) the RFE is for an _additional_ API with characters instead of bytes I don't believe that we would want to break the API, even in a new major version. Please correct me if I'm wrong.

#8	matt 07:12 Jan 20, 2023

You can convert between number of bytes and number of characters with: int fl_utf_nb_char(const unsigned char *buf, int len); and int fl_utf8strlen(const char *text, int len)

#9	AlbrechtS 07:51 Jan 20, 2023

@Matt: I don't think that the info you posted helps in any way to solve the issue. What we'd need is a bunch of new methods that let the user input a character index rather than a byte count (offset). We have, for instance: int Fl_Input_::replace(int b, int e, const char *text, int ilen = 0); Docs say: "Deletes text from b to e and inserts the new string text. ..." In this method `b' and `e' are byte offsets in the string buffer. IMHO we can't change this because it would break all programs that use it. What we'd need is an additional method, like (maybe): int Fl_Input_::replace_char(int b, int e, const char *text, int ilen = 0); where `b' and `e' are *character* indices (or offsets) in the string, such that you could replace text in a specific *column* (or columns) of an Fl_Input widget (provided you use a fixed font for displaying in "columns"). This is only one example of many, and the postfix "_char" could also be "_utf8" or "_uc" (for unicode) or anything else. The problem is that such methods would need UTF-8 character counting from the beginning of the buffer to determine the byte positions. And last but not least: this would need a **bunch** of new methods that do character counting to find the correct byte offsets, not only in Fl_Input but also in many other widgets. At least if we wanted to be "complete" by any means. I'm not sure if this is feasible, particularly since UTF-8 handling in FLTK is meanwhile pretty old and well established. OTOH it could simplify user code if we had it, at the price of a lot more methods and character counting in many widgets.