Which are the HTML, and XML, special characters?

First, you’re comparing a HTML 4.01 specification with an HTML 5 one. HTML5 ties more closely in with XML than HTML 4.01 ever does (that’s why we have XHTML), so this answer will stick to HTML 5 and XML.

Your quoted references are all consistent on the following points:

  • < should always be represented with &lt; when not indicating a processing instruction
  • > should always be represented with &gt; when not indicating a processing instruction
  • & should always be represented with &amp;
  • except when within <![CDATA[ ]]> (which only applies to XML)

I agree 100% with this. You never want the parser to mistake literals for instructions, so it’s a solid idea to always encode any non-space (see below) character. Good parsers know that anything contained within <![CDATA[ ]]> are not instructions, so the encoding is not necessary there.

In practice, I never encode ' or " unless

  • it appears within the value of an attribute (XML or HTML)
  • it appears within the text of XML tags. (<tag>&quot;Yoinks!&quot;, he said.</tag>)

Both specifications also agree with this.

So, the only point of contention is the (space). The only mention of it in either specification is when serialization is attempted. When not, you should always use a literal (space). Unless you are writing your own parser, I don’t see the need to be doing any kind of serialization, so this is beside the point.

Leave a Comment