xiven.com stating the blatantly obvious since 2002

PHP: Invalid UTF-8 characters in XML, revisited

Back in 2008, I wrote a blog post with a function to clean up UTF-8 characters in PHP that were not valid in XML. Some lines of that function no longer work in newer PHP versions due to my use of a blacklist rather than a whitelist, and PHP still doesn't seem to have a proper built-in function for this.

Various proposed solutions to this problem can be found on the net (1, 2, 3, 4), but none of those I found actually do it right (some are in fact quite badly wrong).

Here's one that I believe should handle all cases correctly, and it's a fair bit cleaner than my original one.


<?php
function sanitize_for_xml($v) {
  // Strip invalid UTF-8 byte sequences - this part may not be strictly necessary, could be separated to another function
  $v = mb_convert_encoding(mb_convert_encoding($v, 'UTF-16', 'UTF-8'), 'UTF-8', 'UTF-16');
        
  // Remove various characters not allowed in XML
  $v = preg_replace('/[^\x{0009}\x{000A}\x{000D}\x{0020}-\x{D7FF}\x{E000}-\x{FFFD}\x{10000}-\x{10FFFF}]/u', '�', $v);

  return $v;
}
?>

Posted: 2013-08-30 15:37:40 UTC by Xiven | Cross-references (0) | Comments (1)

It's About Time

Our awesome sysadmin team did some serious overtime over the weekend, thanks to a fun little leap second bug. It took down a scary number of servers, though fortunately our most important external public services escaped largely unscathed (mostly thanks to a high level of redundancy). I too lost a server to this bug and had to spend a little while dealing with the fallout.

Things like this do serve as an important reminder of the sometimes startling effects of invalid assumptions when applied to computers, eg. the assumption that there are always 60 seconds in a minute (though in this particular case the actual kernel bug was far more complicated than that).

Addendum: Bron Gondwana of our FastMail team has now written an excellent write-up of the leap-second incident.

Posted: 2012-07-02 01:16:50 UTC by Xiven | Cross-references (0) | Comments (1)

H₂SO₄

A monumental achievement in web browser technology: Opera Software has now publicly released a development build that finally passes the Acid test!

No, not this Acid test.

Not this one either.

This one!

Posted: 2012-02-28 17:19:10 UTC by Xiven | Cross-references (0) | Comments (0)

Pingback back

On a whim, decided to fix the pingback implementation on this weblog (made it use PHP's XML-RPC module instead of 3rd-party classes). Incidentally, the source code of this website makes me want to cry. Who wrote this junk?…

Posted: 2012-01-08 14:28:33 UTC by Xiven | Cross-references (0) | Comments (0)

Wire-free

Wow, free (working) wi-fi on the Oxford - Heathrow bus. Amazing.

Posted: 2011-12-27 15:46:56 UTC by Xiven | Cross-references (0) | Comments (1)