Midnight Commander (mc) issue with special characters (≥ 0x80)

I’m not sure if this is the right place, but as of lately, when I’m trying to hit enter on an image file with a special character (in the sense of ≥ U+0080, for example German Umlauts (äöü), special Punctuation (’, “, ”, …)) in their name or path, nothing happens. Pressing “f3” doesn’t invoke the display command to print out the image file properties (resoltution, image type, EXIF data), but only the hex view of the file is shown. Special characters below 0x80, i.e. ', ", *, etc. on the other hand are no problem.

my (and the system’s) mc.ext is the stock one, without alterations.

if I change the file name to add an .mp4, then when I hit enter, mplayer gets started trying to play the file, which means the video file detection accepts it, but the image file detection does not. Renaming the file to something without the suspicious character: then it works again.
The problematic character can anywhere in the path, so /home/alex/töst.jpg will not work, but also /home/alex/töst/file.jpg won’t work either.

My locale is en_US.UTF-8. changing the locale to “C”: then mc shows the file name as /home/alex/t??st.jpg, but still no luck in opening the file.

Must be a recent thing, because I have images in countless directories where the path has some of these special characters in it, and they worked at least until a few weeks ago.

Is not a general mc issue, as on my work computer (using Ubuntu) and the same mc version, it works.

Same is happening for me. MC wont open or process files with German Umlauts. This behaviour started like a week ago after “zypper dup” on my tumbleweed pc. Not a major bug but very annoying :frowning:


roninbee@suse:~> localectl  
   System Locale: LANG=en_US.UTF-8
       VC Keymap: de-latin1-nodeadkeys
      X11 Layout: de
       X11 Model: pc105
     X11 Variant: nodeadkeys
     X11 Options: terminate:ctrl_alt_bksp

While I’m glad I’m not the only one with this issue, on my system only image files are affected. I have various videos and their cover image, and while I can play the video by hitting enter, I cannot view the cover image by selecting that and pressing enter.
The way you wrote it seems like it is happening for all files…

I reported a bug at midnight commander’s bugzilla (http://midnight-commander.org/ticket/4377), and it appears the ‘file’ command is the culprit, it mangles the filename in the report.

commit f448f3e5c37de8c285ac14b032b2bdcea82fc08b
Author: Christos Zoulas <christos@zoulas.com>
Date:   Sat May 28 01:04:57 2022 +0000


    PR/351: CathyKMeow: octalify unprintable characters in filenames unless raw.

The commit is first available in file 5.42. “raw” option is available for 20 years so it should be safe to use today by default; but it also applies to result (not only to file names) so may require additional changes in MC.

file just checks single byte for being printable which of course fails for UTF-8 multi-byte characters. You should really open a bug report against file, the change does not look right.

P.S. besides it truncates file name which is a bug by itself (even if we accept non-ASCII mangling):

/tmp/\320\260\320\270: empty
bor@tw:~> ls /tmp/аист | od -b
0000000 057 164 155 160 057 **320 260 320 270** 321 201 321 202 012
0000016
bor@tw:~> 

So half of the non-ASCII characters are lost. And it will be lost even with --raw option. file miscalculates length of output.

bor@tw:~> file -r /tmp/аист
/tmp/аи: empty
bor@tw:~> 

This works in correct Leap 15.3. Thus a regression?

boven:/home/henk/test/unicode # l
total 12
drwxr-xr-x 3 henk wij 4096 May 24  2021 ./
drwxr-xr-x 8 henk wij 4096 May  9 15:50 ../
-rw-r--r-- 1 henk wij    0 May  5  2020 alles goed?
-rw-r--r-- 1 henk wij    0 Jan 11  2016 hello
drwxr-xr-x 2 henk wij 4096 Mar 31  2014 öäüßÖÄÜ/
-rw-r--r-- 1 henk wij    0 Feb 22  2016 Œé⁶
-rw-r--r-- 1 henk wij    0 Jan 11  2016 Χαίρετε
-rw-r--r-- 1 henk wij    0 Jan 11  2016 Здравствуйте
-rw-r--r-- 1 henk wij    0 Jun 20  2016 Лшадсщ
-rw-r--r-- 1 henk wij    0 Jan 11  2016 أهلا
-rw-r--r-- 1 henk wij    0 Jan 11  2016 नमस्ते
boven:/home/henk/test/unicode # file *
alles goed?:  empty
hello:        empty
öäüßÖÄÜ:      directory
Œé⁶:          empty
Χαίρετε:      empty
Здравствуйте: empty
Лшадсщ:       empty
أهلا:         empty
नमस्ते:         empty
boven:/home/henk/test/unicode # 

What version of ‘file’ does 15.3 have?

as @arvidjaar pointed out, this is a recent change in ‘file’ from May 28th.

And yes, I noticed the file shortening, too. Apparently, with UTF-8 file names, it takes the number of characters and truncates the string after so many bytes, not characters. Since UTF-8 will encode characters ≥ U+0080 with more than one byte, this truncates the file name eventually.

And yes, the truncating is definitely an error (regression) in ‘file’. Dealing with the new output format of ‘file’ is something MC should handle. (they are also talking about using the -b option to drop the file name from the output altogether).

Version 5.32-7.14.1

I think this should be reported as a bug.

Thank you all for your ideas and research. It seems like package “file-5.42” is the culprit :slight_smile:

So i’ve downgraded it manually on my tumbleweed machine and everything ok “for now” :wink:


$ zypper search --details --match-exact file


S | Name | Type    | Version  | Arch   | Repository
--+------+---------+----------+--------+------------------------
i | file | package | 5.42-1.1 | x86_64 | openSUSE-Tumbleweed-Oss
v | file | package | 5.42-1.1 | i586   | openSUSE-Tumbleweed-Oss


$ wget https://download.opensuse.org/history/20220618/tumbleweed/repo/oss/x86_64/file-5.41-5.5.x86_64.rpm


$ sudo zypper install --oldpackage file-5.41-5.5.x86_64.rpm

Yeah, it was i think. I was working with pdf, image and libreoffice files, if any had german umlaut MC just ignored them on ENTER.

PS. Forgot to add howto “lock package from updates” in my prev post so here it is :slight_smile:


$ sudo zypper addlock file
$ zypper locks

after this get sorted to remove lock


$ sudo zypper removelock file
$ zypper locks

I since have learned that this all happened with files whose type was detected using the ‘file’ program, but not on files whose type was detected by their extension.
And I have only tried images (detected by ‘file’) and videos (detected by their extension)…

mc have now a bug open, and I have also filed a bug on ‘file’ because even if one uses raw mode for the file names, there is still an error which shortens the file name, which would also make correct detection impossible. Hopefully this gets sorted out soon.

File types are not detected “by extension”. File contents alwys is of some special type (where it simply ASCII characters or random bytes) and people decided to let their file names of files with the same content type end with a the same suffix for easy of memorizing. Very often such a suffix consists of the . (dot) character with a few other characters behind. This looks very much like (and is probably inspired by) the so called extension of the MS-DOS file systems, but it is not the same. E.fg. the . (dot is not part of the extension (or of the file name), it just there to see where the one stops and the other begins. In MS-DOS it also is much more integrated in the operating system.

Unix/Linux itself has no concept of metadata of files describing contents.

Some application programs are like human users, they think that a certain suffix points to a certain type of contents. They may be right, but may be wrong also.

The ‘file’ tool tries to find out what the type of the contents is by using heuristics on the contents itself, partly based on so called “magic numbers” within the file. It has become pretty good in this task and I would always trust it more the methods based on suffices.

Really?

bor@bor-Latitude-E5450:~$ head -n 5 /usr/share/mime/globs
# This file was automatically generated by the
# update-mime-database command. DO NOT EDIT!
text/html:*.html
application/x-doom-wad:*.wad
application/x-cd-image:*.iso
bor@bor-Latitude-E5450:~$ 

https://specifications.freedesktop.org/shared-mime-info-spec/0.11/ar01s03.html#idm46395467815712

I said above, for human beings and application programs. In this case for the application suite called “desktop” as you point to freedesktop.org.

MIME types are a bit different, they are defined for typing files on the Internet. Independent from extensions and suffices.

The table you show (part of) shows how the freedesktop community (or what is the name) sees the connection between MIME types on the internet and suffices that can be used as a substitute on internal files to make desktop programs “understand” what (hopefully) might be in the file that came from the internet.

And in a Apache configuration, you will find it the other way around. There the web manager can define (and there are already suitable defaults) which suffix has to be send off to a client with which MIME type.

All independent from the operating system. and It is NOT something Unix/Linux bothers about.

Still, the current implementation of midnight commander detects images using the ‘file’ command (by analyzing the file) and if you give a file name the extension .mp4, it is happy and invokes mplayer (or whatever video player you have configured) and doesn’t bother analyzing further, it simply assumes it’s a video file.

From the bug report in midnight commander bug tracker:
http://midnight-commander.org/ticket/4377#comment:9

> In default mc.ext videos are detected by filenames with “shell/regex” keys, not by “type”.

Do you have the slightest idea what freedesktop is and provides? They also have weird ideas about how to create desktop files, how to autostart programs, how to build menus and a lot more. Strange people …

It is NOT something Unix/Linux bothers about.

It is something that is used by at least two major desktops on Unix/Linux (KDE and GNOME) to detect file type not counting all other applications using shared MIME specification.

I do not deny that, but it is still only a bunch off applications, not the Unix/Linux system. And for people that do not understand the difference between applications (including desktops that may or may not have on a system) and the system, understanding of Unix/Linux will stay difficult.