Problem with German umlaut

Hallo !

I set up a new fileserver with OpenSuse 11.1 and copied the Files via HDD from the old one ( SUSE 9.1).

Now thera are no Umlaute displayed correctly in the copied file names. Not in KDE ( Dolphin) or in Midnight Commander and most important not in Win XP via Samba.

The old server used de_DE@euro .

First I used the configuration from the original installation (utf8).
Then I changed in sysconfig the Language to de_DE@euro like the old server used. ( controlled via locale) - effect : another wrong char is displaied in mc and KDE and Samba.

Is it possible to find out the code of the old
filenames perhaps they are not de_DE@euro ?

What can I do else ?

Thanks for any help

hbschulte

I do not have the ultimate answer to your problem, but as nobodyy answered you until now I will try a few words.

There is a big difference between Locale and Encoding. I think your problem lies in the encoding. Linux nowadays uses UTF-8 for encoding UNICODE character definitions. It is possible that your ‘old’ filenames are encoded in Latin-1 (for whatever reason, either because the system they were generated on uses Latin-1 by default, or because the Locale used included the usage of Latin-1).

In short this means that your Ä and friends are inside the 00800 - 00FF range in both Latin-1 and UNICODE and are encoded with 1 byte in Latin-1 and with 2 bytes in UTF-8. Giving problems when Ltin-1 encoded text is interpreted as being UTF-8.

I do know at least one method of converting a Latin-1 encoded text file into an UTF-8 encoded one (using vi), but not how to change the interpretation of the filenames on a filesystem (I hope you did not mix both types of file names on the same filesystem).

> What can I do else ?

i can tell you what i do, for file names i never use those Danish
letters which do not appear in english…and, i never create a file
name with a space either…

so what might could have been æ ø å becomes ae oe aa…

ok, so i KNOW it shouldn’t be that way…but, for now it is…you can
keep using special characters, or you can make it EASY on yourself by
avoiding them UNTIL Linux, Windows, Mac and all the rest decide to
work and play nicely together…if you can live that long.

ymmv


somebody_else

Once you have worked out a way of converting things to utf-8 you shouldn’t have a problem; I use a variety of accented characters in folder and file names which I only ever intend to use in Linux.

The only program I have come across which cannot handle spaces in path names is LaTeX (and therefore LyX); so I always use underline instead of space in folder and filenames. But I have never had a problem with filenames containing spaces for files which other people have sent me or I have downloaded.

And may I add an example to what john_hudson says above. I am using 10.3 and KDE 3.5. I just used Konqueror (as filemanager) to make a new directory with Devanagri characters. It shows perfectly correct in Konqui. In a terminal with:

henk@boven:~> ls -l
totaal 276
drwx------  2 henk wij   4096 mei 13 12:27 हिनढी
drwxr-xr-x  3 henk wij   4096 apr 10 09:37 Afbeeldingen
drwxr-xr-x  2 henk wij   4096 mrt 19 12:03 bin
drwx------  2 henk wij   4096 mrt 19 12:10 Desktop
drwxr-xr-x 32 henk wij   4096 mei  4 18:17 Documents
 ...

It also shows correct to me when I am cutting/pasting it inti this post. Not all of you may however see the correct characters, due to the fact that you may not have a font with Devanagri installed. So take my word for it that in nowadays Linux UTF-8 encoded UNICODE is fully implemented.

And my language:

henk@boven:~> echo $LANG
nl_NL.UTF-8
henk@boven:~>   

Some applications may be broken however (and that mostly on white space). And I can not say anything on MicroSoft OSes because I have no knowledge/expperience on them.

But back to the OP. He has a conversion problem. Can anybody help him!

I think the problem is this:

You created the filenames while using Latin-1 as the charset. However now you are using an environment where the charset is UTF-8.

The actual bytes used to represent the name have not changed. However Latin-1 sequences will not display properly in UTF-8.

What you need to do is write a program (shell script, Perl script, whatever) that will do the equivalent of this:

mv name-with-latin-1-characters name-with-utf8-equivalents

The system call “rename” has no problems dealing with arbitrary bytes in names, except for / and NUL of course. The interesting bit is going from the Latin-1 sequences to the equivalent UTF-8 sequences. If you only used a subset of the accented characters, it may be sufficient to work out the byte sequences for ä, ë, ö and ü in both encodings for your program.

> But back to the OP. He has a conversion problem. Can anybody help him!

you are correct, and talking about how wonderfully Linux can handle
whatever language fonts makes no difference…the OP wants to move
files from one system to another, which is where the problems come…

i AVOID all those problems by using english letters and avoiding spaces…

problem solved…no scripts needed.

if you rather spend time building scripts, go ahead.


somebody_else

I do not quite understand what help this is to the OP. You simply repeat what you have said earlier: he should not have used those characters in the past. And that is your Solution?

He has them now. Any positive help please.

Hello hbschulte,

Are you still with us? I can try helping writing a script. I do know how to find the codes in Latin-1 and UTF-8. But when you say: I know understand the problem and can convert myself, that is allright and I will not start digging deeper into this. So please let us hear something from you.

> He has them now. Any positive help please.

easy: remove the non-english characters from those file names and the
problem is solved.


somebody_else

On Wed, 13 May 2009 11:06:01 GMT, ken yap
<ken_yap@no-mx.forums.opensuse.org> wrote:

>
>I think the problem is this:
>
>You created the filenames while using Latin-1 as the charset. However
>now you are using an environment where the charset is UTF-8.
>
>The actual bytes used to represent the name have not changed. However
>Latin-1 sequences will not display properly in UTF-8.
>
>What you need to do is write a program (shell script, Perl script,
>whatever) that will do the equivalent of this:
>
>mv name-with-latin-1-characters name-with-utf8-equivalents
>
>The system call “rename” has no problems dealing with arbitrary bytes
>in names, except for / and NUL of course. The interesting bit is going
>from the Latin-1 sequences to the equivalent UTF-8 sequences. If you
>only used a subset of the accented characters, it may be sufficient to
>work out the byte sequences for ä, ë, ö and ü in both encodings for your
>program.

There is means to designate character sets for each mount, it can be a
bit time consuming. It can be real problematic where multiple
character sets have actually been used.