FLTK logo

STR #3436

FLTK matrix user chat room
(using Element browser app)   FLTK gitter user chat room   GitHub FLTK Project   FLTK News RSS Feed  
  FLTK Library      Forums      Links      Apps     Login 
 Home  |  Articles & FAQs  |  Bugs & Features  |  Documentation  |  Download  |  Screenshots  ]
 

Return to Bugs & Features | Post Text | Post File | SVN ⇄ GIT | Prev | Next ]

STR #3436

Application:FLTK Library
Status:4 - Pending
Priority:3 - Moderate, e.g. unable to compile the software
Scope:3 - Applies to all machines and operating systems
Subsystem:Core Library
Summary:use of isspace(), ispunct(), and others must correctly test unicode characters
Version:1.4-current
Created By:AlbrechtS
Assigned To:AlbrechtS
Fix Version:Unassigned
Update Notification:

Receive EMails Don't Receive EMails

Trouble Report Files:

Post File ]
Name/Time/Date Filename/Size  
 
#1 AlbrechtS
08:31 Nov 17, 2023
check_isalpha
0k
 
 
#2 AlbrechtS
07:38 May 22, 2024
Тест.fl
0k
 
     

Trouble Report Comments:

Post Text ]
Name/Time/Date Text  
 
#1 AlbrechtS
08:14 Nov 17, 2017
This STR is a placeholder for the use of all functions like isspace() and ispunct(). These functions are not unicode aware and most of them, if not all, are defined for int's in the range -1, 0, .., 255, where -1 stands for EOF (end of file). Using these functions with int's outside this range yields "undefined" results.

This has two issues:

(1) moderate: the result is wrong (all platforms).

(2) severe: "Debug" builds of Visual Studio run into an 'assert' failure and the program is terminated. The outcome on other platforms is at least "undefined" (i.e. it may crash as well).

Note: Visual Studio "Release" builds don't fail but return wrong results.

See fltk.coredev, thread "editor fails on cyrillic symbols":
https://groups.google.com/d/msg/fltkcoredev/Yo3LN8jPe0A/TJBj-NzzDAAJ

"start test/editord.exe, copy 'oй' and paste into editor, press Ctrl+Left (previous word, etc) and the application fails!"

The above test scenario seems to be fixed now in FLTK 1.4 (but not in FLTK 1.3), but see Nikita's enumeration of other crashes in:
https://groups.google.com/d/msg/fltkcoredev/Yo3LN8jPe0A/tn8HE9cKDQAJ


Nikita's comment cited here in case the link above doesn't work:

1) Double click at any cyrillic word makes crash in Fl_Text_Editor, you
can test it with my previos example, even after Albrecht's patch.

2) The same action in Fl_Input gives the same result (inputd.exe helps you).

3) Trying to put cyrillic symbol after first @ in label makes crash too
(I used Fluid).

4) Trying to generate code in Fluid when fl file in in russian. E.g.
'Тест.fl'

In other words, I found places where isspace() (isalpha(), etc) is used
without mask ( & 255) and checked them. They are very suspicious ones.

- End of Citation -
 
 
#2 AlbrechtS
08:23 Nov 17, 2017
More detailed information: POSIX defines the following functions, some of them may be used in FLTK (or not). Reference man7.org, POSIX man pages:
http://man7.org/linux/man-pages/man3/isalpha.3p.html

isalpha, isalnum, isblank, iscntrl, isdigit, isgraph,
islower, isprint, ispunct, isspace, isupper, isxdigit

From the man page referenced above (isalpha):

"The c argument is an int, the value of which the application shall ensure is representable as an unsigned char or equal to the value of the macro EOF. If the argument has any other value, the behavior is undefined."

Note: the Macro EOF is usually equivalent to the value (-1), but implementation defined.
 
 
#3 AlbrechtS
08:38 Nov 17, 2023
FWIW, I posted a shell script 'check_isalpha' that can be used to find all occurrences of the mentioned functions that are used in current FLTK (Git master, commit 44bb080c0ff81b16d48dccd8d15809f058cc68ea).

This needs more investigation:

1. Check if every usage of these functions is on the correct parameter types and that the return value is properly tested.

2. Functions like 'isupper()' should verify not to test parts (single bytes) of UTF-8 sequences.
 
 
#4 manolo
06:05 May 15, 2024
This was done :
1. Check if every usage of these functions is on the correct parameter types and that the return value is properly tested.

This was also done even if I believe it's not necessary because
UTF8 makes that every byte of the encoding of a non-ascii character is non-ascii :
2. Functions like 'isupper()' should verify not to test parts (single bytes) of UTF-8 sequences.

Only 2 changes were found necessary and committed at 9df6dc2.
 
 
#5 AlbrechtS
07:38 May 22, 2024
Reopening because the underlying problems are not resolved so it doesn't get forgotten.

Git commit 9df6dc2aeb37682a2ef1906e04b9e0436e63a6ba may have changed some minor issues but the overall problem still exists.

Addition to the original post: two more functions need to be considered:
 - toupper()
 - tolower()

The current status - tested with VS 2019 - *seems* to be that Visual Studio in Debug mode does no longer raise an assert error. Thus I lowered the priority to 3 (Moderate). I can't test with older or other VS versions though; there might be assert errors with older VS versions.

Recap and summary of the remaining problem:

1) All the mentioned functions can only called with ASCII characters or with 'unsigned char' values that are propagated by the call to 'int' in the range 0 ... 255.

2) Wherever we call one of the mentioned functions with a 'char' as argument, all non-ASCII (UTF-8) characters have the high bit set in each of their two to four bytes. Thus, propagation to 'int' (done by the compiler) calls the function(s) with invalid (negative) values. This results in undefined behavior - although it appears to be benign in many cases.

3) The "naïve fix" of (2) would be to mask each byte with '&255' or '&0xff' before calling any of these functions. An alternative would be to cast to 'unsigned [char]' This fixes the "negative value" issue but not the problem: it still results in unintended results. Example: the Euro character U+20ac is three bytes (hex: e2, 82, ac) in UTF-8 encoding. It is obvious that calling any of the functions either for the first or for all three bytes consecutively can't answer whether '€' is uppercase or lowercase or any of the other "questions".

3) Thus every occurrence of any of these functions must be considered in its context and - if necessary or useful - we'd have to call Unicode-capable replacement functions per *character* and not per byte.

Anyway, this should be kept open and fixed in the future. It should not block the release of 1.4.x though since it doesn't seem to be high priority anymore.

My take on this: let's fix it in 1.5.0.

An example of unexpected results can be seen in fluid if the filename contains non-ASCII characters, e.g. 'test.fl' in Russian (cyrillic alphabet) would be 'Тест.fl' (see uploaded file). The generated header 'Тест.h' looks like:
```
// Minimal fluid example
// generated by Fast Light User Interface Designer (fluid) version 1.0400

#ifndef __________h
#define __________h
#include <FL/Fl.H>
#include <FL/Fl_Box.H>
#endif
```

The '#ifndef' and '#define' lines are obviously not what we want, otherwise all header files with 4 (cyrillic) letters would output the same header guards.

This is *only* an example (the relevant code is in 'fluid/code.cxx').
 
 
#6 AlbrechtS
07:44 May 22, 2024
I didn't mention:

(1a) Calling any of these functions with "non-ASCII characters", i.e. any byte in the range 0x80 - 0xff depends on the current locale (aka codepage). This is no longer supported in FLTK, we require a UTF-8 capable locale and all text *must* be properly encoded as UTF-8.
 
 
#7 AlbrechtS
06:02 Jul 06, 2025
For further progress please view and edit GitHub Issue #1276:
https://github.com/fltk/fltk/issues/1276
 
     

Return to Bugs & Features | Post Text | Post File ]

 
 

Comments are owned by the poster. All other content is copyright 1998-2026 by Bill Spitzak and others. This project is hosted by The FLTK Team. Please report site problems to 'erco@seriss.com'.