Dr. Strange-charset-conversion Or: How I Learned To Stop Worrying And Love Converting From Latin1 to UTF8

Recently, at work, I’d decided it was time to deal with my MySQL charset problems. In most places my server db was set to a latin1 charset. Unfortunately, someone who maintains content has been pasting unicode characters in. While my public websites display the information correctly, when I synced the content to my local development computer those unicode characters would display incorrectly. The image below shows the comparison.

Charset problem comparison

This was puzzling, and despite everything I tried I couldn’t get that same content to display correctly on my iMac. However, this was a localized problem on my iMac because my MacBook displayed everything correctly. I tried but I couldn’t figure out where the discrepancy was between the two machines so I decided to move on and fix it at the source.

There are tens of thousands of posts on the Internet explaining how to convert to UTF8 but nothing seemed to work until I discovered this post on the WordPress Codex. That post was the first one I could find that specifically addressed unicode content stored in latin1 columns and how to properly convert it.

To put it simply, you need to convert your text columns to their binary field type counterpart and then convert back to the text type but specifying the utf8 charset. Once I did this, the page displayed correctly on my iMac. Here’s what I did.

  • Backed up my database. While this wasn’t really needed since it’s just my local develoment machine, it’s still a good practice.
  • Opened up my terminal and ran mysql with the -s –skip-column-names flags that make it easier to copy and paste the output.
  • Ran the following script. NOTE: when doing this yourself, remember to replace “db_name” with the name of your own database.
USE information_schema;  
SELECT CONCAT('ALTER TABLE `', table_name, '` MODIFY `', column_name, '` ', REPLACE(column_type, 'char', 'binary'), ';') FROM columns WHERE table_schema = 'db_name' and data_type = 'char';  
SELECT CONCAT('ALTER TABLE `', table_name, '` MODIFY `', column_name, '` ', REPLACE(column_type, 'varchar', 'varbinary'), ';') FROM columns WHERE table_schema = 'db_name' and data_type = 'varchar';  
SELECT CONCAT('ALTER TABLE `', table_name, '` MODIFY `', column_name, '` ', REPLACE(column_type, 'tinytext', 'tinybinary'), ';') FROM columns WHERE table_schema = 'db_name' and data_type = 'tinytext';  
SELECT CONCAT('ALTER TABLE `', table_name, '` MODIFY `', column_name, '` ', REPLACE(column_type, 'text', 'blob'), ';') FROM columns WHERE table_schema = 'db_name' and data_type = 'text';  
SELECT CONCAT('ALTER TABLE `', table_name, '` MODIFY `', column_name, '` ', REPLACE(column_type, 'mediumtext', 'mediumbinary'), ';') FROM columns WHERE table_schema = 'db_name' and data_type = 'mediumtext';  
SELECT CONCAT('ALTER TABLE `', table_name, '` MODIFY `', column_name, '` ', REPLACE(column_type, 'longtext', 'longbinary'), ';') FROM columns WHERE table_schema = 'db_name' and data_type = 'longtext';  

SELECT CONCAT('ALTER TABLE `', table_name, '` MODIFY `', column_name, '` ', column_type, ' CHARACTER SET utf8;') FROM columns WHERE table_schema = 'db_name' and data_type = 'char';  
SELECT CONCAT('ALTER TABLE `', table_name, '` MODIFY `', column_name, '` ', column_type, ' CHARACTER SET utf8;') FROM columns WHERE table_schema = 'db_name' and data_type = 'varchar';
SELECT CONCAT('ALTER TABLE `', table_name, '` MODIFY `', column_name, '` ', column_type, ' CHARACTER SET utf8;') FROM columns WHERE table_schema = 'db_name' and data_type = 'tinytext';
SELECT CONCAT('ALTER TABLE `', table_name, '` MODIFY `', column_name, '` ', column_type, ' CHARACTER SET utf8;') FROM columns WHERE table_schema = 'db_name' and data_type = 'text';
SELECT CONCAT('ALTER TABLE `', table_name, '` MODIFY `', column_name, '` ', column_type, ' CHARACTER SET utf8;') FROM columns WHERE table_schema = 'db_name' and data_type = 'mediumtext';
SELECT CONCAT('ALTER TABLE `', table_name, '` MODIFY `', column_name, '` ', column_type, ' CHARACTER SET utf8;') FROM columns WHERE table_schema = 'db_name' and data_type = 'longtext';

SELECT DISTINCT CONCAT('ALTER TABLE `', table_name, '` CONVERT TO CHARACTER SET utf8;') FROM columns WHERE table_schema = 'db_name';

The output, in my case, was like this…

ALTER TABLE `registrations` MODIFY `name` varbinary(50);
ALTER TABLE `registrations` MODIFY `name` varchar(50) CHARACTER SET utf8;
ALTER TABLE `registrations` CONVERT TO CHARACTER SET utf8;
ALTER DATABASE `db_name` CHARACTER SET utf8;

This is just a snippet of my code but you can see exactly what’s going on. First, the registrations.name column is being converted to varbinary(50). Binary columns don’t care about character sets. Then it’s being converted back to it’s original varchar(50) column but specifying “CHARACTER SET utf8”. Finally, we convert the table and database to utf8.

Now go forth and spread the P.O.E. … Purity Of Encodings. 🙂