STR #3436: use of isspace(), ispunct(), and others must correctly test unicode characters

STR #3436

Application:	FLTK Library
Status:	5 - New
Priority:	4 - High, e.g. key functionality not working
Scope:	3 - Applies to all machines and operating systems
Subsystem:	Core Library
Summary:	use of isspace(), ispunct(), and others must correctly test unicode characters
Version:	1.4-current
Created By:	AlbrechtS
Assigned To:	AlbrechtS
Fix Version:	Unassigned
Update Notification:	Receive EMails Don't Receive EMails

Trouble Report Files:

Post File

#1	AlbrechtS 08:31 Nov 17, 2023

check_isalpha
0k

Trouble Report Comments:

Post Text

#1	AlbrechtS 08:14 Nov 17, 2017

This STR is a placeholder for the use of all functions like isspace() and ispunct(). These functions are not unicode aware and most of them, if not all, are defined for int's in the range -1, 0, .., 255, where -1 stands for EOF (end of file). Using these functions with int's outside this range yields "undefined" results. This has two issues: (1) moderate: the result is wrong (all platforms). (2) severe: "Debug" builds of Visual Studio run into an 'assert' failure and the program is terminated. The outcome on other platforms is at least "undefined" (i.e. it may crash as well). Note: Visual Studio "Release" builds don't fail but return wrong results. See fltk.coredev, thread "editor fails on cyrillic symbols": https://groups.google.com/d/msg/fltkcoredev/Yo3LN8jPe0A/TJBj-NzzDAAJ "start test/editord.exe, copy 'oй' and paste into editor, press Ctrl+Left (previous word, etc) and the application fails!" The above test scenario seems to be fixed now in FLTK 1.4 (but not in FLTK 1.3), but see Nikita's enumeration of other crashes in: https://groups.google.com/d/msg/fltkcoredev/Yo3LN8jPe0A/tn8HE9cKDQAJ Nikita's comment cited here in case the link above doesn't work: 1) Double click at any cyrillic word makes crash in Fl_Text_Editor, you can test it with my previos example, even after Albrecht's patch. 2) The same action in Fl_Input gives the same result (inputd.exe helps you). 3) Trying to put cyrillic symbol after first @ in label makes crash too (I used Fluid). 4) Trying to generate code in Fluid when fl file in in russian. E.g. 'Тест.fl' In other words, I found places where isspace() (isalpha(), etc) is used without mask ( & 255) and checked them. They are very suspicious ones. - End of Citation -

#2	AlbrechtS 08:23 Nov 17, 2017

More detailed information: POSIX defines the following functions, some of them may be used in FLTK (or not). Reference man7.org, POSIX man pages: http://man7.org/linux/man-pages/man3/isalpha.3p.html isalpha, isalnum, isblank, iscntrl, isdigit, isgraph, islower, isprint, ispunct, isspace, isupper, isxdigit From the man page referenced above (isalpha): "The c argument is an int, the value of which the application shall ensure is representable as an unsigned char or equal to the value of the macro EOF. If the argument has any other value, the behavior is undefined." Note: the Macro EOF is usually equivalent to the value (-1), but implementation defined.

#3	AlbrechtS 08:38 Nov 17, 2023

FWIW, I posted a shell script 'check_isalpha' that can be used to find all occurrences of the mentioned functions that are used in current FLTK (Git master, commit 44bb080c0ff81b16d48dccd8d15809f058cc68ea). This needs more investigation: 1. Check if every usage of these functions is on the correct parameter types and that the return value is properly tested. 2. Functions like 'isupper()' should verify not to test parts (single bytes) of UTF-8 sequences.