mysql Charset Problem after Upgrade 11.1 - 11.2

Hi,

i did an upgrade of opensyse 11.1 to 11.2. When i upgraded last times from 10.1 to 10.2 aso. i never had any Problems with my data.

This time there is an hugh trouble of charset violations.
This exactly means, dumped or upgraded data from mysql 5.0 aren’t usable in mysql 5.1.36 and i still didn’ find out how to convert them to get them run.

The problem occours with some php Webapplications.
The only thing i got fixed was to use the old data if i set manually character-set-result to latin1 or explicit tell the application to use utf8.
I’m only able to solve the problem by setting manualy:
global character_set_results=latin1;

I fear, this is a general issue or any idea how i could get it permanent working?

That’s the solution until you are ready to migrate your data to utf-8.

Very funny.

Could you explain how to migrate the data from an old standard latin1 environment to utf8?

I tried it several times to export the data via mysqldump
for example:
mysqldump --skip-set-charset --default-character-set=UTF8

replaced settings like:
ENGINE=MyISAM DEFAULT CHARSET=latin1;
by
ENGINE=MyISAM DEFAULT CHARSET=utf8;

The result of all.
Maybe the old data is show correctly. But all new data don’t work.
Maybe you have an simple idea how to.
The only point i got my old data / Applications running was:

By this Configuration

character_set_client | utf8 |
| character_set_connection | utf8 |
| character_set_database | utf8 |
| character_set_filesystem | binary |
| character_set_results | latin1 |
| character_set_server | utf8 |
| character_set_system | utf8 |
| character_sets_dir | /usr/share/mysql/charsets/ |
| collation_connection | utf8_unicode_ci |
| collation_database | utf8_unicode_ci |
| collation_server | utf8_unicode_ci

and and standard export from following sourcedb 5.0

character_set_client | latin1 |
| character_set_connection | latin1 |
| character_set_database | latin1 |
| character_set_filesystem | binary |
| character_set_results | latin1 |
| character_set_server | latin1 |
| character_set_system | utf8 |
| character_sets_dir | /usr/share/mysql/charsets/ |
| collation_connection | latin1_swedish_ci |
| collation_database | latin1_swedish_ci |
| collation_server | latin1_swedish_ci

The only way it works currently is:
If i add Data with above uft8 configuration.
For example by vBB with option default-character-set=‘latin1’ character_set_results = latin1 it works but not with utf8.
The Tables are all utf8 instead of latin1 before.

thx

No, quite serious.

You have to convert all your string data from latin-1 to utf-8. Because there are byte sequences in latin-1 that are not legal utf-8 sequences. Otherwise you have to make the declaration of the charset match the data for the time being in all operations, including export and import.

It could be that when you imported the data, you had utf-8 in effect so there were lots of “illegal chars”. Normally the dump should contain a declaration of the charset for each table but maybe that was wrong or missed out.

So many things could have gone wrong, it’s hard to be very definite.

Ok, here an small example of my Data:

DROP TABLE IF EXISTS vbadminhelp;
SET @saved_cs_client = @@character_set_client;
SET character_set_client = utf8;
CREATE TABLE vbadminhelp (
adminhelpid int(10) unsigned NOT NULL auto_increment,
script varchar(50) NOT NULL default ‘’,
action varchar(25) NOT NULL default ‘’,
optionname varchar(100) NOT NULL default ‘’,
displayorder smallint(5) unsigned NOT NULL default ‘1’,
volatile smallint(5) unsigned NOT NULL default ‘0’,
product varchar(25) NOT NULL default ‘’,
PRIMARY KEY (adminhelpid),
UNIQUE KEY phraseunique (script,action,optionname)
) ENGINE=MyISAM AUTO_INCREMENT=27637 DEFAULT CHARSET=utf8;
SET character_set_client = @saved_cs_client;

And now the import:

INSERT INTO vbadminhelp VALUES (27442,‘usergroup’,‘add,edit’,‘canusecustomtitle’,200,1,‘vbulletin’),(27441,‘usergroup’,‘add,edit’,‘canjoingroups’,200,1,‘vbulletin’),(27440,‘usergroup’,‘add,edit’,‘canseeraters’,190,1,‘vbulletin’),(27439,‘usergroup’,‘add,edit’,‘showeditedby’,180,1,‘vbulletin’),(27438,‘usergroup’,‘add,edit’,‘caninvisible’,170,1,‘vbulletin’),(27437,‘usergroup’,‘add,edit’,‘canmodifyprofile’,160,1,‘vbulletin’)

DROP TABLE IF EXISTS vbword;SET @saved_cs_client = @@character_set_client;SET character_set_client = utf8;CREATE TABLE vbword ( wordid int(10) unsigned NOT NULL auto_increment, title char(50) NOT NULL default ‘’, PRIMARY KEY (wordid), UNIQUE KEY title (title)) ENGINE=MyISAM AUTO_INCREMENT=16229 DEFAULT CHARSET=utf8;SET character_set_client = @saved_cs_client;---- Dumping data for table vbword–LOCK TABLES vbword WRITE;/*!40000 ALTER TABLE vbword DISABLE KEYS */;INSERT INTO vbword VALUES (1,‘aprilscherz’),(2,‘heise’),(3,‘thema’),(4,‘prozessorkühlung’),(5,‘übertragen’),(6,‘wärmestrahlung’),(7,‘elektromagnetische’),(8,‘strahlung’)

[client]
#password = your_password
port = 3306
socket = /var/run/mysql/mysql.sock
default_character_set = utf8

And this is, what phpMyAdmin tells me about the field title:
char(50) utf8_general_ci

At the old System it was:
char(50) latin1_swedish_ci

On both Systems: LC_CTYPE=de_DE.UTF-8

I should be able to import this data by mysql client correctly or?

There should be some comments in front of each table declaring the charset. Check those. Perhaps you should not try to force the charset while exporting and let mysqldump do its thing. You have to realise that forcing the charset doesn’t do any byte conversions on the data. You may claim that the table is utf-8 but it will still contain latin-1 sequences.


– Table structure for table vbadminhelp

DROP TABLE IF EXISTS vbadminhelp;
SET @saved_cs_client = @@character_set_client;
SET character_set_client = utf8;
CREATE TABLE vbadminhelp (
adminhelpid int(10) unsigned NOT NULL auto_increment,
script varchar(50) NOT NULL default ‘’,
action varchar(25) NOT NULL default ‘’,
optionname varchar(100) NOT NULL default ‘’,
displayorder smallint(5) unsigned NOT NULL default ‘1’,
volatile smallint(5) unsigned NOT NULL default ‘0’,
product varchar(25) NOT NULL default ‘’,
PRIMARY KEY (adminhelpid),
UNIQUE KEY phraseunique (script,action,optionname)
) ENGINE=MyISAM AUTO_INCREMENT=27637 DEFAULT CHARSET=utf8;
SET character_set_client = @saved_cs_client;


– Dumping data for table vbadminhelp

Thats all.

i also tried ut8 encoding by perl.
The result was an nightmare.

You have a comment in front of the table claiming that the charset is utf-8, but in fact the chars in the strings has not undergone any conversion. So I think you should not specify any charset when dumping.

As I explained, the charset attached to a table is only metadata. It doesn’t ensure that the data is in that charset. You can put anything you like into a varchar or text type, it’s just a sequence of octets. But if your application relies on the metadata to specify the charset of the HTML page, then it had better match the data.

yk’s stuff: MySQL: CHARSET from latin1 to utf8

How about this example and voncerting the data / sqldump by iconv?

ok, last but not least.
If i install the application new. With all the utf8 Settings.
The default data of it’s application specially for öäü works correctly.

And as far as i know, for my apps better than or this example, we don’t use characters which aren’t utf8 konform.
But the same Problem with all the apps.

My own apps, just uses the defaults.

Yes, you have to convert the data separately. If you search you will find lots of tips for checking and converting the columns. Only after you have converted the data can you legitimately set the charset of the table to utf-8. That assumes that the app actually reads this metadata.

I have an app that always declares every HTML page to be utf-8 because I know that is how the data is stored, even though the mysql version is 5.0 and the default charset latin-1. And even if the columns and tables are declared wrong, it doesn’t matter, except for collation, which I don’t use in this app anyway.

ok, but the exported textfile / dump ist utf-8.

Conversion does not work:
iconv -f utf-8 -t latin1 <./forum_zz.sql | mysql -u root
iconv: illegal input sequence at position 788253
ERROR 1064 (42000) at line 780: You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near ‘’’ at line 1

How ever, i’m not able to translate every single character for any table manually.

The question is also why the defaults was changed with opensuse 11.2.

No, the dump is not utf-8. It only says utf-8 because you forced mysqldump to say that and it’s a lie. That’s what I’ve been trying to tell you for the last three posts. The data is still latin-1, the data is just a sequence of octets and that’s what mysqldump writes out. And that kind of conversion with iconv might not work anyway, because the data might be written out with escaped ASCII.

The default changed because mysql 5.1 uses utf-8 as the default charset now; it’s an international world.

A simple question at last:
You explained me, that the strings needn’t be utf-8 even i declared the table as utf-8.

What would happen if i do an update by mysql client on, for example vbword like:

update vbword set title =‘öäü’;
Would the app show it correct or would i get something like ‘???’ ?

with: global character-set-rsult=utf8;

How else how i would be able to convert the data in an easy way from latin1 to utf8?

There was a link to a oreilly article in the page you referenced, that’s a pretty good description. Whether it’s easy or not for you I can’t say.

You could also continue to work in latin-1. Just don’t force mysqldump to use utf-8 (I assume you are dumping on the old machine where it’s still latin-1) and it should dump it with latin-1 and this will be imported as latin-1 on the new mysql. Provided your web page also says that the output is in latin-1 (iso-8859 actually), it should be displayed correctly. Though not in a terminal if the terminal is displaying in utf-8.

That actually depends on the charset of the terminal session you are issuing the command from. If utf-8 then you will end up with a cell that is utf-8 while the rest of the data is latin-1.

Sorry, that was exactly the beginning of the trouble.

I did an dump on the old Environment. Configured the new environment with standard charset latin one.
Global Variables have been the same like on the old environment.
But the data in the new db wasn’t usable.

Maybe i should tell you, that I’m DBA and Charset conversion is one of my daily jobs.
A very interesting thing i found out.
A new table created by my own fingers in mysql console was created as utf8.
I inserted manualy some data.

Any php applications was only able to read it as latin1.
Who made latin1 of it?
What environment setting is wrong?
On the same PC on Oracle / sqlplus i do the same with an utf8 Oracle DB.

I get utf8 Data out.
Please explain me this point.

Ok, this would be an reason.
How i can find out what charset is set in my console?
On my Mac i definied utf-8 enconding.

On my Linux Host i get following env out:

LESSKEY=/etc/lesskey.bin
NNTPSERVER=news
MANPATH=/usr/share/man:/usr/local/man
HOSTNAME=r-sux
XKEYSYMDB=/usr/share/X11/XKeysymDB
HOST=r-sux
TERM=xterm-color
SHELL=/bin/bash
PROFILEREAD=true
HISTSIZE=1000
MORE=-sl
SSH_TTY=/dev/pts/2
USER=root
LS_COLORS=
XNLSPATH=/usr/share/X11/nls
ENV=/etc/bash.bashrc
HOSTTYPE=i386
FROM_HEADER=
PAGER=less
CSHEDIT=emacs
XDG_CONFIG_DIRS=/etc/xdg
MINICOM=-c on
MAIL=/var/mail/root
PATH=/sbin:/usr/sbin:/usr/local/sbin:/root/bin:/usr/local/bin:/usr/bin:/bin:/usr/bin/X11:/usr/X11R6/bin:/usr/games:/usr/lib/mit/bin:/usr/lib/mit/sbin:/usr/sbin
CPU=i686
INPUTRC=/etc/inputrc
PWD=/install
LANG=POSIX
PYTHONSTARTUP=/etc/pythonstart
QT_SYSTEM_DIR=/usr/share/desktop-data
SHLVL=1
HOME=/root
LESS_ADVANCED_PREPROCESSOR=no
OSTYPE=linux
LS_OPTIONS=-A -N --color=none -T 0
XCURSOR_THEME=
LESS=-M -I
MACHTYPE=i686-suse-linux
LOGNAME=root
XDG_DATA_DIRS=/usr/share
LC_CTYPE=de_DE.UTF-8
LESSOPEN=lessopen.sh %s
LESSCLOSE=lessclose.sh %s %s
G_BROKEN_FILENAMES=1
COLORTERM=1
OLDPWD=/root
_=/usr/bin/env

I don’t know what terminal emulator you are using but for konsole, it’s under Settings > Edit current profile > Advanced > Encoding.