HTML/XSS escape on input vs output

In addition to what has been written already:

  • Precisely because you have a variety of output formats, and you cannot guarantee that all of them will need HTML escaping. If you are serving data over a JSON API, you have no idea whether the client needs it for a HTML page or a text output (e.g. an email). Why should you force your client to unescape “Jack & Jill” to get “Jack & Jill”?

  • You are corrupting your data by default.

    • When someone does a keyword search for ‘amp’, they get “Jack & Jill”. Why? Because you’ve corrupted your data.

    • Suppose one of the inputs is a URL: http://example.com/?x=1&y=2. You want to parse this URL, and extract the y parameter if it exists. This silently fails, because your URL has been corrupted into http://example.com/?x=1&y=2.

  • It’s simply the wrong layer to do it – HTML related stuff should not be mixed up with raw HTTP handling. The database shouldn’t be storing things that are related to one possible output format.

  • XSS and SQL Injection are not the only security problems, there are issues for every output you deal with – such as filesystem (think extensions like ‘.php’ that cause web servers to execute code) and SMTP (think newline characters), and any number of others. Thinking you can “deal with security on input and then forget about it” decreases security. Rather you should be delegating escaping to specific backends that don’t trust their input data.

  • You shouldn’t be doing HTML escaping “all over the place”. You should be doing it exactly once for every output that needs it – just like with any escaping for any backend. For SQL, you should be doing SQL escaping once, same goes for SMTP etc. Usually, you won’t be doing any escaping – you’ll be using a library that handles it for you.

    If you are using sensible frameworks/libraries, this is not hard. I never manually apply SQL/SMTP/HTML escaping in my web apps, and I never have XSS/SQL injection vulnerabilities. If your method of building web pages requires you to remember to apply escaping, or end up with a vulnerability, you are doing it wrong.

  • Doing escaping at the form/http input level doesn’t ensure safety, because nothing guarantees that data doesn’t get into your database or system from another route. You’ve got to manually ensure that all inputs to your system are applying HTML escaping.

    You may say that you don’t have other inputs, but what if your system grows? It’s often too late to go back and change your decision, because by this time you’ve got a ton of data, and may have compatibility with external interfaces e.g. public APIs to worry about, which are all expecting the data to be HTML escaped.

  • Even web inputs to the system are not safe, because often you have another layer of encoding applied e.g. you might need base64 encoded input in some entry point. Your automatic HTML escaping will miss any HTML encoded within that data. So you will have to do HTML escaping again, and remember to do, and keep track of where you have done it.

I’ve expanded on these here: http://lukeplant.me.uk/blog/posts/why-escape-on-input-is-a-bad-idea/

Leave a Comment