Php – how to detect and fix character encoding in a thesql database via php

character encodingMySQLPHPspecial characters

I have received this database full of people names and data in French, which means, using characters such as é,è,ö,û, etc. Around 3000 entries.

Apparently, the data inside has been encoded sometimes using utf8_encode(), and sometimes not. This result in a messed up output: at some places the characters show up fine, at others they don't.

At first i tried to track down every place in the UI where those issues arise and use utf8_decode() where necessary, but it's really not a practicable solution.

I did some testing and there is no reason to use utf8_encode in the first place, so i'd rather remove all that and just work in UTF8 everywhere – at the browser, middleware and database levels. So i need to clean the database, converting all misencoded data by its cleaned up version.

Question : would it be possible to create a function in php that would check if a utf8 string is correctly encoded (without utf8_encode) or not (with utf8_encode), and, if it was, convert it back to its original state?

In other terms: i would like to know how i could detect utf8 content that has been utf8_encode() to utf8 content that has not been utf8_encode()d.

**UPDATE: EXAMPLE **

Here is a good example: you take a string full of special chars and take a copy of that string and utf8_encode() it. The function i'm dreaming of takes both strings, leaves the first one untouched and the second string is now the same as string one.

I tried this:

$loc_fr = setlocale(LC_ALL, 'fr_BE.UTF8','fr_BE@euro', 'fr_BE', 'fr', 'fra', 'fr_FR');
$str1= "éèöûêïà ";
$str2 = utf8_encode($str1);

function convert_charset($str) {
    $charset=  mb_detect_encoding($str);
    if( $charset=="UTF-8" ) {
        return utf8_decode($str);
    }
    else {
        return $str;
    }
}
function correctString($str) {
    echo "\nbefore: $str";
    $str= convert_charset($str);
    echo "\nafter: $str"; 
}

correctString($str1);
echo('<hr/>'."\n");
correctString($str2);

And that gives me:

before: éèöûêïà after: ������� 
before: Ã©Ã¨Ã¶Ã»ÃªÃ¯Ã  after: éèöûêïà

Thanks,

Alex

Best Answer

It's not completely clear from the question what character-encoding lens you're currently looking through (this depends on the defaults of your text editor, browser headers, database configuration, etc), and what character-encoding transformations the data has gone through. It may be that, for example, by tweaking a database configuration everything will be corrected, and that's a lot better than making piecemeal changes to data.

It looks like it might be a problem of utf8 double-encoding, and if that's the case, both the original and the corrupted data will be in utf8, so encoding detection won't give you the information you need. The approach in that case requires making assumptions about what characters can reasonably turn up in your data: as far as PHP and Mysql are concerned "Ã©" is perfectly legal utf8, so you have to make a judgement based on what you know about the data and its authors that it must be corrupted. These are risky assumptions to make if you're just a technician. Luckily, if you know the data is in French and there's only 3000 records, it's probably ok to make those kinds of assumptions.

Below is a script that you can adapt first of all to check your data, then to correct it, and finally to check it again. All it's doing is processing a string as utf8, breaking it into characters, and comparing the characters against a whitelist of expected French characters. It signals a problem if the string is either not in utf8 or contains characters that aren't normally expected in French, for example:

PROBABLY OK     Côte d'Azur
HAS NON-WHITELISTED CHAR        CÃ´te d'Azur    195,180 Ã´
NON-UTF8        C�e d'Azur

Here's the script, you'll need to download the dependent unicode functions from http://hsivonen.iki.fi/php-utf8/

<?php

// Download from http://hsivonen.iki.fi/php-utf8/
require "php-utf8/utf8.inc";

$my_french_whitelist = array_merge(
  range(0,127), // throw in all the lower ASCII chars
  array(
    0xE8, // small e-grave
    0xE9, // small e-acute
    0xF4, // small o-circumflex
    //... Will need to add other accented chars,
    // Euro sign, and whatever other chars
    // are normally expected in the data.
  )
);

// NB, whether this string literal is in utf8
// depends on the encoding of the text editor
// used to write the code
$str1 = "Côte d'Azur";
$test_data = array(
  $str1,
  utf8_encode($str1),
  utf8_decode($str1),
);

foreach($test_data as $str){
  $questionable_chars = non_whitelisted(
    $my_french_whitelist,
    $str
  );
  if($questionable_chars===true){
    p("NON-UTF8", $str);
  }else if ($questionable_chars){
    p(
      "HAS NON-WHITELISTED CHAR",
      $str,
      implode(",", $questionable_chars),
      unicodeToUtf8($questionable_chars)
    );
  }else{
    p("PROBABLY OK", $str);
  }
}

function non_whitelisted($whitelist, $utf8_str){
  $codepoints = utf8ToUnicode($utf8_str);
  if($codepoints===false){ // has non-utf8 char
    return true;
  }
  return array_diff(
    array_unique($codepoints),
    $whitelist
  );
}


function p(){
  $args = func_get_args();
  echo implode("\t", $args), "\n";
}

Connecting to MYSQL with Python 2 in three steps

1 - Setting

You must install a MySQL driver before doing anything. Unlike PHP, Only the SQLite driver is installed by default with Python. The most used package to do so is MySQLdb but it's hard to install it using easy_install. Please note MySQLdb only supports Python 2.

For Windows user, you can get an exe of MySQLdb.

For Linux, this is a casual package (python-mysqldb). (You can use sudo apt-get install python-mysqldb (for debian based distros), yum install MySQL-python (for rpm-based), or dnf install python-mysql (for modern fedora distro) in command line to download.)

For Mac, you can install MySQLdb using Macport.

2 - Usage

After installing, Reboot. This is not mandatory, But it will prevent me from answering 3 or 4 other questions in this post if something goes wrong. So please reboot.

Then it is just like using any other package :

#!/usr/bin/python
import MySQLdb

db = MySQLdb.connect(host="localhost",    # your host, usually localhost
                     user="john",         # your username
                     passwd="megajonhy",  # your password
                     db="jonhydb")        # name of the data base

# you must create a Cursor object. It will let
#  you execute all the queries you need
cur = db.cursor()

# Use all the SQL you like
cur.execute("SELECT * FROM YOUR_TABLE_NAME")

# print all the first cell of all the rows
for row in cur.fetchall():
    print row[0]

db.close()

Of course, there are thousand of possibilities and options; this is a very basic example. You will have to look at the documentation. A good starting point.

3 - More advanced usage

Once you know how it works, You may want to use an ORM to avoid writing SQL manually and manipulate your tables as they were Python objects. The most famous ORM in the Python community is SQLAlchemy.

I strongly advise you to use it: your life is going to be much easier.

I recently discovered another jewel in the Python world: peewee. It's a very lite ORM, really easy and fast to setup then use. It makes my day for small projects or stand alone apps, Where using big tools like SQLAlchemy or Django is overkill :

import peewee
from peewee import *

db = MySQLDatabase('jonhydb', user='john', passwd='megajonhy')

class Book(peewee.Model):
    author = peewee.CharField()
    title = peewee.TextField()

    class Meta:
        database = db

Book.create_table()
book = Book(author="me", title='Peewee is cool')
book.save()
for book in Book.filter(author="me"):
    print book.title

This example works out of the box. Nothing other than having peewee (pip install peewee) is required.

Best Answer

Related Solutions

Mysql – How to quickly rename a MySQL database (change schema name)

Python – How to connect to a MySQL Database in Python

Connecting to MYSQL with Python 2 in three steps

Related Topic