xiven.com stating the blatantly obvious since 2002

Archive

View: 2012, Jan, Feb, Mar, Apr, May, Jun, Jul, Aug, Sep, Oct, Nov, Dec, 2014, By category, Full index

Viewing entries for August 2013

PHP: Invalid UTF-8 characters in XML, revisited

Back in 2008, I wrote a blog post with a function to clean up UTF-8 characters in PHP that were not valid in XML. Some lines of that function no longer work in newer PHP versions due to my use of a blacklist rather than a whitelist, and PHP still doesn't seem to have a proper built-in function for this.

Various proposed solutions to this problem can be found on the net (1, 2, 3, 4), but none of those I found actually do it right (some are in fact quite badly wrong).

Here's one that I believe should handle all cases correctly, and it's a fair bit cleaner than my original one.


<?php
function sanitize_for_xml($v) {
  // Strip invalid UTF-8 byte sequences - this part may not be strictly necessary, could be separated to another function
  $v = mb_convert_encoding(mb_convert_encoding($v, 'UTF-16', 'UTF-8'), 'UTF-8', 'UTF-16');
        
  // Remove various characters not allowed in XML
  $v = preg_replace('/[^\x{0009}\x{000A}\x{000D}\x{0020}-\x{D7FF}\x{E000}-\x{FFFD}\x{10000}-\x{10FFFF}]/u', '�', $v);

  return $v;
}
?>

Posted: 2013-08-30 14:37:40 UTC by Xiven | Cross-references (0) | Comments (2)