xiven.com stating the blatantly obvious since 2002

Learning Curve

Some lessons have to be learned the hard way. Today my lesson was in treating input unicode data carefully.

Last night I received an automated e-mail from my web server's control panel warning me that memory usage had been at 80% for the past 30 minutes. Somewhat worrying, so I looked at where the memory was going. The vast majority seemed to be going to Apache threads, and there seemed to be many more threads there than usual. Not really having time to look into it in-depth, and knowing that the server had been running for a long time, I restarted Apache and left it till tomorrow.

Next day, things are still working fine, and I go in to work. Checking IRC when I get in, I find not one but two messages quoting me the following message from #whatwg on Freenode:

[22:02] <Philip`> The error message on http://www.xiven.com/weblog/search/results?q=%00 seems quite peculiar

The error message in question was an out-of-memory error from PHP in my XML parsing script. The connection with the odd memory events of last night was made...

At first I thought it was just a case of a problem with null characters (%00 is the URL-encoded hex code for null, and nulls can often be problematic). But %00 (or U+0000) was not the only thing that would cause this crash: other characters included %ef%bf%bf (U+FFFF) and %ef%bf%be (U+FFFE), and invalid UTF-8 byte sequences were also causing problems. Additionally, ASCII control characters such as %01 (U+0001) to %08 (U+0008) etc, although not causing the out-of-memory error, would make the XML generated by the website be not well-formed, giving the infamous "Yellow Screen of Death" on Gecko browsers.

The following PHP code was somewhat hastily added to the main script for the website:


<?php
array_walk_recursive($_GET, 'stripInvalid');
array_walk_recursive($_POST, 'stripInvalid');

function stripInvalid(&$v, $k) {
        // Strip invalid UTF-8 byte sequences
        $v = mb_convert_encoding(mb_convert_encoding($v, 'UTF-16', 'UTF-8'), 'UTF-8', 'UTF-16');
        
        // Remove various characters not allowed in XML
        $v = preg_replace('/[\x{0000}-\x{0008}]/u', '', $v); // ASCII control characters
        $v = preg_replace('/[\x{000B}\x{000C}]/u', '', $v); // Vertical tab & Form feed
        $v = preg_replace('/[\x{000E}-\x{001F}]/u', '', $v); // ASCII control characters
        $v = preg_replace('/[\x{007F}-\x{009F}]/u', '', $v); // ASCII control characters
        $v = preg_replace('/[\x{2190}-\x{2BFF}]/u', '?', $v);
        $v = preg_replace('/[\x{D800}-\x{DFFF}]/u', '?', $v);
        $v = preg_replace('/[\x{FDD0}-\x{FDEF}]/u', '?', $v);
        $v = preg_replace('/[\x{FFFE}\x{FFFF}]/u', '?', $v);
        $v = preg_replace('/[\x{1FFFE}\x{1FFFF}\x{2FFFE}\x{2FFFF}\x{3FFFE}\x{3FFFF}]/u', '?', $v);
        $v = preg_replace('/[\x{4FFFE}\x{4FFFF}\x{5FFFE}\x{5FFFF}\x{6FFFE}\x{6FFFF}]/u', '?', $v);
        $v = preg_replace('/[\x{7FFFE}\x{7FFFF}\x{8FFFE}\x{8FFFF}\x{9FFFE}\x{9FFFF}]/u', '?', $v);
        $v = preg_replace('/[\x{AFFFE}\x{AFFFF}\x{BFFFE}\x{BFFFF}\x{CFFFE}\x{CFFFF}]/u', '?', $v);
        $v = preg_replace('/[\x{DFFFE}\x{DFFFF}\x{EFFFE}\x{EFFFF}\x{FFFFE}\x{FFFFF}]/u', '?', $v);
        $v = preg_replace('/[\x{10FFFE}\x{10FFFF}]/u', '?', $v);
}
?>

Hopefully that covers everything (checked various docs online as to what characters in XML should be excluded). I haven't had the chance yet to look more in-depth as to why PHP was consuming all memory and apparently crashing Apache threads on certain characters; whether the exact cause is in my code or in PHP's XML parser itself, I'm not sure yet. But for now at least, sanitizing the UTF-8 characters of all input variables should keep things relatively safe. I have little doubt though that there must be a better way…

Posted: 2008-03-15 02:20:24 UTC by Xiven | Cross-references (1) | Comments (3)

Comments