xiven.com stating the blatantly obvious since 2002

PHP: Invalid UTF-8 characters in XML, revisited

Back in 2008, I wrote a blog post with a function to clean up UTF-8 characters in PHP that were not valid in XML. Some lines of that function no longer work in newer PHP versions due to my use of a blacklist rather than a whitelist, and PHP still doesn't seem to have a proper built-in function for this.

Various proposed solutions to this problem can be found on the net (1, 2, 3, 4), but none of those I found actually do it right (some are in fact quite badly wrong).

Here's one that I believe should handle all cases correctly, and it's a fair bit cleaner than my original one.


<?php
function sanitize_for_xml($v) {
  // Strip invalid UTF-8 byte sequences - this part may not be strictly necessary, could be separated to another function
  $v = mb_convert_encoding(mb_convert_encoding($v, 'UTF-16', 'UTF-8'), 'UTF-8', 'UTF-16');
        
  // Remove various characters not allowed in XML
  $v = preg_replace('/[^\x{0009}\x{000A}\x{000D}\x{0020}-\x{D7FF}\x{E000}-\x{FFFD}\x{10000}-\x{10FFFF}]/u', '�', $v);

  return $v;
}
?>

Posted: 2013-08-30 14:37:40 UTC by Xiven | Cross-references (0) | Comments (2)

Cross-references

None

Comments

  • Xiven (Registered) (2013-08-30 14:40:51 UTC)

    Note: I'm replacing everything that's not valid with a U+FFFD REPLACEMENT CHARACTER.

  • gab (2017-01-20 08:32:32 UTC)

    Thanks! Finally something working ;)