Friday, July 24, 2009

ISO-8859-1 to UTF-8 file conversion...

In an effort to get all the educational materials ready for the classes in El Salvador, I've found myself battling on two fronts today with character encoding not working. Fortunately, both problems turned out to be easy to solve, and both turned out to be the result of having ISO-8859-1 when what I wanted was UTF-8.

The first one was that the Spanish translation of materials generated by sphinx did not render properly on the Open Book Project site, even though the exact same materials rendered fine when I viewed them on another website.

I filed an ibiblio trouble ticket, and within an hour Donald Sizemore fixed the problem:
ibiblio's DefaultCharset is still iso-8859-1 - we tried to change it to UTF-8 during our last major upgrade, but it broke too much of our older content.

I've changed the DefaultCharset setting for openbookproject.net to UTF-8 and the page looks good to me. Check behind me? You may have to hold down Control- or Shift- and click Refresh in your browser to pick up the changes.
Cómo Pensar como un Informático now renders characters just fine on the OBP site.

A bit later in the day I began working on helping Gregorio Inda with his translation of the GASP Python Course. When running sphinx-build on the launchpad checkout of his latest work, there were over 150 warnings from one file, one for each accented charater and inverted question mark in the file.

I couldn't figure out what was going on, since sheet 1 looked the same as sheet 2 to me and didn't report any errors. A colleague suggested comparing the two files using the unix file command, and sure enough that revealed the problem:

$ file *
1-intro.rst: UTF-8 Unicode Java program text
2-tablas.rst: ISO-8859 Java program text

ISO-8859 is haunting me today! Fortunately, this web page told me what I needed to do to fix the problem:

$ iconv --from-code=ISO-8859-1 --to-code=UTF-8 2-tablas.rst > temp && mv temp 2-tablas.rst

Problem solved.

No comments:

Post a Comment