« Server Burp, Back Now | Main | Changes to structure and minor downtime »

January 27, 2006

Character Encoding and XML Parsers

Procrastination has its benefits. It's a good thing, since that's one of my core skill sets.

I've received several comments and suggestions about how to make ILENN better and more compatible (thanks!) from many sources, and despite the dearth of updates here, development has been ongoing. In the interest of being geeky, I thought I'd discuss some of the exploration about making things better. Techie geeky warning.

One of the core issues is how to properly display multilingual feeds using the proper character sets. As an example, ILENN currently pulls English, Japanese, and Hungarian feeds, each with their own charset. However, it's danged difficult to display them with appropriate glyphs (characters). Surprisingly, the holdup has been Hungarian... the Japanese issue has (I think) been solved.

This all comes back to how I'm retrieving the feeds. Currently, I'm using a PHP RSS parser called Magpie, which does a very good job at reading in lotsa different types of feeds. But it's also a bit limited in some ways. So I started looking for a different parser, and stumbled onto one called SimplePie, which is also PHP based. It just appeared this month. SimplePie pulls in a bunch of flavors of RSS and Atom feeds, as well as being multilingual. It's also really straightforward to use, which is great. It's still in beta, and has a few kinks yet to be worked out, but it's getting there rapidly.

So the last few days have been spent culling the 'net for character encoding, Unicode, RSS parsers, and all other sorts of things to make life better for my non-English-speaking friends. I'm pretty confident that SimplePie will become the core ILENN parser, as soon as I do some more testing. I also need to make sure I'm storing the data in the DB correctly, and it can get pulled out and displayed in the appropriate character set.

So now the biggest challenge is seeing if I can display different languages together. If not, then we'll need to make sure there are as many language optimizations as possible, to set the character encoding property. That means a Hungarian reader would click the Hungarian link, and see Hungarian feeds using the proper character set, but other characters (like Japanese, Korean, Chinese, Hebrew, etc.) won't show up properly.

But since a person reading something in one language won't necessarily be able to read feeds from other languages, it's probably not a big deal, and maybe I'm making more of this than it needs to be...

Posted by Kelly McKiernan at January 27, 2006 09:46 AM

Comments

Post a comment




Remember Me?

(you may use HTML tags for style)