I have some trouble with botched filename encoding involving the german ß (Unicode U+00DF) letter.
Example:
> ls -1
TeÃ?t
Teßt
Te?t
> ls -1b
TeÃ\302\237t
Teßt
Te\302\237t
File 2 has the correct name, file 1 is the observed botched encoding, and file 3 is just there for test purposes.
If I run duperemove -hdr ... on a hierarchy including such a file, the terminal will freeze as soon as it is supposed to display one of the ‘?’ marked characters. I am able to reproduce this somewhat with find . -print0, and ls -1 | cat.
The files with the encoding have been on my system for some time, the problem did surface only with the update to Leap 15.6, because invoking duperemove from the terminal has been a regular exercise on 15.4 and 15.5.
Does anyone have an idea, whether this might be a configuration issue, or a glitch due to some new underying library?
The terminal sessions have become quite resilient for the past decades versus accidental control character emission from displaying, e.g., binary files - so this comes as a little bit of a surprise for me. Hence my suspicion that some library update might be involved…
Hoping for some ideas here, before I file a bug report.
You forgot to tell what terminal emulator you are using, but yes, I can reproduce this effect in GNOME Terminal, more precisely this one
(I need ls -1N, because in my case it defaults to escaping non-printable names). The difference is that ls -1N still replaces invalid characters with ? when printing to a terminal:
So, terminal emulator gets invalid UTF-8 character. That is kind of “garbage in - garbage out”.
That is wrong in any case. You cannot send binary zero to a terminal and expect sane results. But yes, here it is the same effect - terminal gets raw file names including invalid UTF-8 character.
And no, nothing “stops”. Part of output text is lost, that’s all.
And, as you mention this, I did check ‘ls -1 |cat’ on the virtual text console, on xterm, and on byobu terminal - there the invalid characters are simply ommitted and the rest of the text is displayed. byobu used TERM=screen-256color, and xterm used TERM=xterm, but changing this in Konsole does not help.
I wrote, that I could reproduce the issue “somewhat”: The freezing behaviour I mention happens when running duperemove, which does not terminate for a long time. The screen acts, as if ^S had been pressed and no further output happens until, e.g., I enter ^Z followed by “fg”. If “ls” did not terminate, the experience would likely be the same in that all following output is lost.
That seems to narrow the issue to Konsole, I guess.
It is possible. Every now and then I had zypper appear “frozen” when running under Konsole, while in reality it was just output in the terminal window and zypper itself may have completed long ago.
I take it back. The UTF-8 $'\302\237' is UNICODE U+009f which is actually C1 control character. It is often used to start an inline command sequence sent to a terminal. So, plausible explanation is that Konsole expects the valid command sequence including final terminator(s).
Again - you cannot just send an arbitrary binary string to a terminal (emulator) and expect a useful result.
It is not me sending the file names to terminal, but duperemove.
Of course there are control characters which cause a terminal to react. I had already noticed the Unicode codepoint definition as “control”, however, I would never have guessed, that someone would use them for terminal control. One would think, that the usual suspects in ASCII are sufficient. Thank you for pointing that out.
Basically this is another upgrade (15.5->15.6) which breaks in an unexpected manner (as expected). One major reason (for me) not to use something like Tumbleweed.
The files in question had been unpacked from a ZIP file, as far as I recall. My preferred solution would be a ban of certain characters, or character sequences from filenames, i.e. a file naming policy. Is there a way to accomplish this? I could not find a mechanism to accomplish this, so far.
I would just like to rule out that such a mechanism exists, before I conjure a script for file name checking.
My best guess is that file names are using 8-bit character set, like DOS/Windows code page where both 0x9f and 0xc2 are valid characters. I am not sure whether ZIP format stores the file name encoding.
No. File names are an arbitrary sequence of bytes. This ship has sailed long ago.