How do I escape quotes in HTML attribute values?

Question

Actually you may need one of these two functions (this depends on the context of use). These functions handle all kind of string quotes, and also protect from the HTML/XML syntax.

The quoteattr() function for embeding text into HTML/XML:
====
The quoteattr() function is used in a context, where the result will not be evaluated by javascript but must be interpreted by an XML or HTML parser, and it must absolutely avoid breaking the syntax of an element attribute.

Newlines are natively preserved if generating the content of a text elements. However, if you’re generating the value of an attribute this assigned value will be normalized by the DOM as soon as it will be set, so all whitespaces (SPACE, TAB, CR, LF) will be compressed, stripping leading and trailing whitespaces and reducing all middle sequences of whitespaces into a single SPACE.

But there’s an exception: the CR character will be preserved and not treated as whitespace, only if it is represented with a numeric character reference! The result will be valid for all element attributes, with the exception of attributes of type NMTOKEN or ID, or NMTOKENS: the presence of the referenced CR will make the assigned value invalid for those attributes (for example the id=”…” attribute of HTML elements): this value being invalid, will be ignored by the DOM. But in other attributes (of type CDATA), all CR characters represented by a numeric character reference will be preserved and not normalized. Note that this trick will not work to preserve other whitespaces (SPACE, TAB, LF), even if they are represented by NCR, because the normalization of all whitespaces (with the exception of the NCR to CR) is mandatory in all attributes.

Note that this function itself does not perform any HTML/XML normalization of whitespaces, so it remains safe when generating the content of a text element (don’t pass the second preserveCR parameter for such case).

So if you pass an optional second parameter (whose default will be treated as if it was false) and if that parameter evaluates as true, newlines will be preserved using this NCR, when you want to generate a literal attribute value, and this attribute is of type CDATA (for example a title=”…” attribute) and not of type ID, IDLIST, NMTOKEN or NMTOKENS (for example an id=”…” attribute).

function quoteattr(s, preserveCR) {
    preserveCR = preserveCR ? '&#13;' : '\n';
    return ('' + s) /* Forces the conversion to string. */
        .replace(/&/g, '&amp;') /* This MUST be the 1st replacement. */
        .replace(/'/g, '&apos;') /* The 4 other predefined entities, required. */
        .replace(/"/g, '&quot;')
        .replace(/</g, '&lt;')
        .replace(/>/g, '&gt;')
        /*
        You may add other replacements here for HTML only 
        (but it's not necessary).
        Or for XML, only if the named entities are defined in its DTD.
        */ 
        .replace(/\r\n/g, preserveCR) /* Must be before the next replacement. */
        .replace(/[\r\n]/g, preserveCR);
        ;
}

Warning! This function still does not check the source string (which is just, in Javascript, an unrestricted stream of 16-bit code units) for its validity in a file that must be a valid plain text source and also as valid source for an HTML/XML document.

It should be updated to detect and reject (by an exception):
- any code units representing code points assigned to non-characters (like \uFFFE and \uFFFF): this is an Unicode requirement only for valid plain-texts;
- any surrogate code units which are incorrectly paired to form a valid pair for an UTF-16-encoded code point: this is an Unicode requirement for valid plain-texts;
- any valid pair of surrogate code units representing a valid Unicode code point in supplementary planes, but which is assigned to non-characters (like U+10FFFE or U+10FFFF): this is an Unicode requirement only for valid plain-texts;
- most C0 and C1 controls (in the ranges \u0000..\u1F and \u007F..\u009F with the exception of TAB and newline controls): this is not an Unicode requirement but an additional requirement for valid HTML/XML.
Despite of this limitation, the code above is almost what you’ll want to do. Normally. Modern javascript engine should provide this function natively in the default system object, but in most cases, it does not completely ensure the strict plain-text validity, not the HTML/XML validity. But the HTML/XML document object from which your Javascript code will be called, should redefine this native function.
This limitation is usually not a problem in most cases, because the source string are the result of computing from sources strings coming from the HTML/XML DOM.
But this may fail if the javascript extract substrings and break pairs of surrogates, or if it generates text from computed numeric sources (converting any 16-bit code value into a string containing that one-code unit, and appending those short strings, or inserting these short strings via replacement operations): if you try to insert the encoded string into a HTML/XML DOM text element or in an HTML/XML attribute value or element name, the DOM will itself reject this insertion and will throw an exception; if your javascript inserts the resulting string in a local binary file or sends it via a binary network socket, there will be no exception thrown for this emission. Such non-plain text strings would also be the result of reading from a binary file (such as an PNG, GIF or JPEG image file) or from your javascript reading from a binary-safe network socket (such that the IO stream passes 16-bit code units rather than just 8-bit units: most binary I/O streams are byte-based anyway, and text I/O streams need that you specify a charset to decode files into plain-text, so that invalid encodings found in the text stream will throw an I/O exception in your script).

Note that this function, the way it is implemented (if it is augmented to correct the limitations noted in the warning above), can be safely used as well to quote also the content of a literal text element in HTML/XML (to avoid leaving some interpretable HTML/XML elements from the source string value), not just the content of a literal attribute value ! So it should be better named quoteml(); the name quoteattr() is kept only by tradition.

This is the case in your example:

data.value = "It's just a \"sample\" <test>.\n\tTry & see yourself!";
var row = '';
row += '<tr>';
row += '<td>Name</td>';
row += '<td><input value="' + quoteattr(data.value) + '" /></td>';
row += '</tr>';

Alternative to `quoteattr()`, using only the DOM API:

The alternative, if the HTML code you generate will be part of the current HTML document, is to create each HTML element individually, using the DOM methods of the document, such that you can set its attribute values directly through the DOM API, instead of inserting the full HTML content using the innerHTML property of a single element :

data.value = "It's just a \"sample\" <test>.\n\tTry & see yourself!";
var row = document.createElement('tr');
var cell = document.createElement('td');
cell.innerText="Name";
row.appendChild(cell);
cell = document.createElement('td');
var input = document.createElement('input');
input.setAttribute('value', data.value);
cell.appendChild(input);
tr.appendChild(cell);
/*
The HTML code is generated automatically and is now accessible in the
row.innerHTML property, which you are not required to insert in the
current document.

But you can continue by appending tr into a 'tbody' element object, and then
insert this into a new 'table' element object, which ou can append or insert
as a child of a DOM object of your document.
*/

Note that this alternative does not attempt to preserve newlines present in the data.value, because you’re generating the content of a text element, not an attribute value here. If you really want to generate an attribute value preserving newlines using , see the start of section 1, and the code within quoteattr() above.

The escape() function for embedding into a javascript/JSON literal string:
====
In other cases, you’ll use the escape() function below when the intent is to quote a string that will be part of a generated javascript code fragment, that you also want to be preserved (that may optionally also be first parsed by an HTML/XML parser in which a larger javascript code could be inserted):

function escape(s) {
return (” + s) /* Forces the conversion to string. /
.replace(/\/g, ‘\\’) / This MUST be the 1st replacement. /
.replace(/\t/g, ‘\t’) / These 2 replacements protect whitespaces. /
.replace(/\n/g, ‘\n’)
.replace(/\u00A0/g, ‘\u00A0’) / Useful but not absolutely necessary. /
.replace(/&/g, ‘\x26’) / These 5 replacements protect from HTML/XML. */
.replace(/’/g, ‘\x27’)
.replace(/”/g, ‘\x22’)
.replace(/</g, ‘\x3C’)
.replace(/>/g, ‘\x3E’)
;
}

Warning! This source code does not check for the validity of the encoded document as a valid plain-text document. However it should never raise an exception (except for out of memory condition): Javascript/JSON source strings are just unrestricted streams of 16-bit code units and do not need to be valid plain-text or are not restricted by HTML/XML document syntax. This means that the code is incomplete, and should also replace:

all other code units representing C0 and C1 controls (with the exception of TAB and LF, handled above, but that may be left intact without substituting them) using the \xNN notation;
all code units that are assigned to non-characters in Unicode, which should be replaced using the \uNNNN notation (for example \uFFFE or \uFFFF);
all code units usable as Unicode surrogates in the range \uD800..\DFFF, like this:
- if they are not correctly paired into a valid UTF-16 pair representing a valid Unicode code point in the full range U+0000..U+10FFFF, these surrogate code units should be individually replaced using the notation \uDNNN;
- else if if the code point that the code unit pair represents is not valid in Unicode plain-text, because the code point is assigned to a non-character, the two code points should be replaced using the notation \U00NNNNNN;
finally, if the code point represented by the code unit (or the pair of code units representing a code point in a supplementary plane), independently of if that code point is assigned or reserved/unassigned, is also invalid in HTML/XML source documents (see their specification), the code point should be replaced using the \uNNNN notation (if the code point is in the BMP) or the \u00NNNNNN (if the code point is in a supplementary plane) ;

Note also that the 5 last replacements are not really necessary. But it you don’t include them, you’ll sometimes need to use the <![CDATA[ ... ]]> compatibility “hack” in some cases, such as further including the generated javascript in HTML or XML (see the example below where this “hack” is used in a <script>...</script> HTML element).

The escape() function has the advantage that it does not insert any HTML/XML character reference, the result will be first interpreted by Javascript and it will keep later at runtime the exact string length when the resulting string will be evaluated by the javascript engine. It saves you from having to manage mixed context throughout your application code (see the final section about them and about the related security considerations). Notably because if you use quoteattr() in this context, the javascript evaluated and executed later would have to explicitly handle character references to re-decode them, something that would not be appropriate. Usage cases include:

when the replaced string will be inserted in a generated javascript event handler surrounded by some other HTML code where the javascript fragment will contain attributes surrounded by literal quotes).
when the replaced string will be part of a settimeout() parameter which will be later eval()ed by the Javascript engine.

Example 1 (generating only JavaScript, no HTML content generated):

var title = "It's a \"title\"!";
var msg   = "Both strings contain \"quotes\" & 'apostrophes'...";
setTimeout(
    '__forceCloseDialog("myDialog", "' +
        escape(title) + '", "' +
        escape(msg) + '")',
    2000);

Exemple 2 (generating valid HTML):

var msg =
    "It's just a \"sample\" <test>.\n\tTry & see yourself!";
/* This is similar to the above, but this JavaScript code will be reinserted below: */ 
var scriptCode="alert("" +
    escape(msg) + /* important here!, because part of a JS string literal */
    '");';

/* First case (simple when inserting in a text element): */
document.write(
    '<script type="text/javascript">' +
    '\n//<![CDATA[\n' + /* (not really necessary but improves compatibility) */
    scriptCode +
    '\n//]]>\n' +       /* (not really necessary but improves compatibility) */
    '</script>');

/* Second case (more complex when inserting in an HTML attribute value): */
document.write(
    '<span onclick="' +
    quoteattr(scriptCode) + /* important here, because part of an HTML attribute */
    '">Click here !</span>');

In this second example, you see that both encoding functions are simultaneously used on the part of the generated text that is embedded in JavaScript literals (using escape()), with the the generated JavaScript code (containing the generated string literal) being itself embedded again and re-encoded using quoteattr(), because that JavaScript code is inserted in an HTML attribute (in the second case).

General considerations for safely encoding texts to embed in syntactic contexts:
====
So in summary,

the quotattr() function must be used when generating the content of an HTML/XML attribute literal, where the surrounding quotes are added externally within a concatenation to produce a complete HTML/XML code.
the escape() function must be used when generating the content of a JavaScript string constant literal, where the surrounding quotes are added externally within a concatenation to produce a complete HTML/XML code.
If used carefully, and everywhere you will find variable contents to safely insert into another context, and under only these rules (with the functions implemented exactly like above which takes care of “special characters” used in both contexts), you may mix both via multiple escaping, and the transform will still be safe, and will not require additional code to decode them in the application using those literals. Do not use these functions.

Those functions are only safe in those strict contexts (i.e. only HTML/XML attribute values for quoteattr(), and only Javascript string literals for escape()).

There are other contexts using different quoting and escaping mechanisms (e.g. SQL string literals, or Visual Basic string literals, or regular expression literals, or text fields of CSV data files, or MIME header values), which will each require their own distinct escaping function used only in these contexts:

Never assume that quoteattr() or escape() will be safe or will not alter the semantic of the escaped string, before checking first, that the syntax of (respectively) HTML/XML attribute values or JavaScript string literals will be natively understood and supported in those contexts.
For example the syntax of Javascript string literals generated by escape() is also appropriate and natively supported in the two other contexts of string literals used in Java programming source code, or text values in JSON data.

But the reverse is not always true. For example:

Interpreting the encoded escaped literals initially generated for other contexts than Javascript string literals (including for example string literals in PHP source code), is not always safe for direct use as Javascript literals. through the javascript eval() system function to decode those generated string literals that were not escaped using escape(), because those other string literals may contain other special characters generated specifically to those other initial contexts, which will be incorrectly interpreted by Javascript, this could include additional escapes such as “\Uxxxxxxxx“, or “\e“, or “${var}” and “$$“, or the inclusion of additional concatenation operators such as ' + " which changes the quoting style, or of “transparent” delimiters, such as “” or “<[DATA[” and “]]>” (that may be found and safe within a different only complex context supporting multiple escaping syntaxes: see below the last paragraph of this section about mixed contexts).
The same will apply to the interpretation/decoding of encoded escaped literals that were initially generated for other contexts that HTML/XML attributes values in documents created using their standard textual representation (for example, trying to interpret the string literals that were generated for embedding in a non standard binary format representation of HTML/XML documents!)
This will also apply to the interpretation/decoding with the javascript function eval() of string literals that were only safely generated for inclusion in HTML/XML attribute literals using quotteattr(), which will not be safe, because the contexts have been incorrectly mixed.
This will also apply to the interpretation/decoding with an HTML/XML text document parser of attribute value literals that were only safely generated for inclusion in a Javascript string literal using escape(), which will not be safe, because the contexts have also been incorrectly mixed.

Safely decoding the value of embedded syntactic literals:
====
If you want to decode or interpret string literals in contexts were the decoded resulting string values will be used interchangeably and indistinctly without change in another context, so called mixed contexts (including, for example: naming some identifiers in HTML/XML with string literals initially safely encoded with quotteattr(); naming some programming variables for Javascript from strings initially safely encoded with escape(); and so on…), you’ll need to prepare and use a new escaping function (which will also check the validity of the string value before encoding it, or reject it, or truncate/simplify/filter it), as well as a new decoding function (which will also carefully avoid interpreting valid but unsafe sequences, only accepted internally but not acceptable for unsafe external sources, which also means that decoding function such as eval() in javascript must be absolutely avoided for decoding JSON data sources, for which you’ll need to use a safer native JSON decoder; a native JSON decoder will not be interpreting valid Javascript sequences, such as the inclusion of quoting delimiters in the literal expression, operators, or sequences like “{$var}“), to enforce the safety of such mapping!

These last considerations about the decoding of literals in mixed contexts, that were only safely encoded with any syntax for the transport of data to be safe only a a more restrictive single context, is absolutely critical for the security of your application or web service. Never mix those contexts between the encoding place and the decoding place, if those places do not belong to the same security realm (but even in that case, using mixed contexts is always very dangerous, it is very difficult to track precisely in your code.

For this reason I recommend you never use or assume mixed contexts anywhere in your application: instead write a safe encoding and decoding function for a single precide context that has precise length and validity rules on the decoded string values, and precise length and validity rules on the encoded string string literals. Ban those mixed contexts: for each change of context, use another matching pair of encoding/decoding functions (which function is used in this pair depends on which context is embedded in the other context; and the pair of matching functions is also specific to each pair of contexts).

This means that:

To safely decode an HTML/XML attribute value literal that has been initially encoded with quoteattr(), you must ”’not”’ assume that it has been encoded using other named entities whose value will depend on a specific DTD defining it. You must instead initialize the HTML/XML parser to support only the few default named character entities generated by quoteattr() and optionally the numeric character entities (which are also safe is such context: the quoteattr() function only generates a few of them but could generate more of these numeric character references, but must not generate other named character entities which are not predefined in the default DTD). All other named entities must be rejected by your parser, as being invalid in the source string literal to decode. Alternatively you’ll get better performance by defining an unquoteattr function (which will reject any presence of literal quotes within the source string, as well as unsupported named entities).
To safely decode a Javascript string literal (or JSON string literal) that has been initially encoded with escape(), you must use the safe JavaScript unescape() function, but not the unsafe Javascript eval() function!

Examples for these two associated safe decoding functions follow.

The unquoteattr() function to parse text embedded in HTML/XML text elements or attribute values literals:
====

function unquoteattr(s) {
/*
Note: this can be implemented more efficiently by a loop searching for
ampersands, from start to end of ssource string, and parsing the
character(s) found immediately after after the ampersand.
/
s = (” + s); / Forces the conversion to string type. /
/
You may optionally start by detecting CDATA sections (like
<![CDATA[ … ]]>), whose contents must not be reparsed by the
following replacements, but separated, filtered out of the CDATA
delimiters, and then concatenated into an output buffer.
The following replacements are only for sections of source text
found outside such CDATA sections, that will be concatenated
in the output buffer only after all the following replacements and
security checkings.

 This will require a loop starting here.

 The following code is only for the alternate sections that are
 not within the detected CDATA sections.
 */
 /* Decode by reversing the initial order of replacements. */
 s = s
     .replace(/\r\n/g, '\n') /* To do before the next replacement. */ 
     .replace(/[\r\n]/, '\n')
     .replace(/&#13;&#10;/g, '\n') /* These 3 replacements keep whitespaces. */
     .replace(/&#1[03];/g, '\n')
     .replace(/&#9;/g, '\t')
     .replace(/&gt;/g, '>') /* The 4 other predefined entities required. */
     .replace(/&lt;/g, '<')
     .replace(/&quot;/g, '"')
     .replace(/&apos;/g, "'")
     ;
 /*
 You may add other replacements here for predefined HTML entities only 
 (but it's not necessary). Or for XML, only if the named entities are
 defined in *your* assumed DTD.
 But you can add these replacements only if these entities will *not* 
 be replaced by a string value containing *any* ampersand character.
 Do not decode the '&amp;' sequence here !

 If you choose to support more numeric character entities, their
 decoded numeric value *must* be assigned characters or unassigned
 Unicode code points, but *not* surrogates or assigned non-characters,
 and *not* most C0 and C1 controls (except a few ones that are valid
 in HTML/XML text elements and attribute values: TAB, LF, CR, and
 NL='\x85').

 If you find valid Unicode code points that are invalid characters
 for XML/HTML, this function *must* reject the source string as
 invalid and throw an exception.

 In addition, the four possible representations of newlines (CR, LF,
 CR+LF, or NL) *must* be decoded only as if they were '\n' (U+000A).

 See the XML/HTML reference specifications !
 */
 /* Required check for security! */
 var found = /&[^;]*;?/.match(s);
 if (found.length >0 && found[0] != '&amp;')
     throw 'unsafe entity found in the attribute literal content';
  /* This MUST be the last replacement. */
 s = s.replace(/&amp;/g, '&');
 /*
 The loop needed to support CDATA sections will end here.
 This is where you'll concatenate the replaced sections (CDATA or
 not), if you have splitted the source string to detect and support
 these CDATA sections.

 Note that all backslashes found in CDATA sections do NOT have the
 semantic of escapes, and are *safe*.

 On the opposite, CDATA sections not properly terminated by a
 matching `]]>` section terminator are *unsafe*, and must be rejected
 before reaching this final point.
 */
 return s;

}

Note that this function does not parse the surrounding quote delimiters which are used
to surround HTML attribute values. This function can in fact decode any HTML/XML text element content as well, possibly containing literal quotes, which are safe. It’s your reponsability of parsing the HTML code to extract quoted strings used in HTML/XML attributes, and to strip those matching quote delimiters before calling the unquoteattr() function.

The unescape() function to parse text contents embedded in Javascript/JSON literals:
====

function unescape(s) {
/*
Note: this can be implemented more efficiently by a loop searching for
backslashes, from start to end of source string, and parsing and
dispatching the character found immediately after the backslash, if it
must be followed by additional characters such as an octal or
hexadecimal 7-bit ASCII-only encoded character, or an hexadecimal Unicode
encoded valid code point, or a valid pair of hexadecimal UTF-16-encoded
code units representing a single Unicode code point.

 8-bit encoded code units for non-ASCII characters should not be used, but
 if they are, they should be decoded into a 16-bit code units keeping their
 numeric value, i.e. like the numeric value of an equivalent Unicode
 code point (which means ISO 8859-1, not Windows 1252, including C1 controls).

 Note that Javascript or JSON does NOT require code units to be paired when
 they encode surrogates; and Javascript/JSON will also accept any Unicode
 code point in the valid range representable as UTF-16 pairs, including
 NULL, all controls, and code units assigned to non-characters.
 This means that all code points in \U00000000..\U0010FFFF are valid,
 as well as all 16-bit code units in \u0000..\uFFFF, in any order.
 It's up to your application to restrict these valid ranges if needed.
 */
 s = ('' + s) /* Forces the conversion to string. */
 /* Decode by reversing the initial order of replacements */
     .replace(/\\x3E/g, '>')
     .replace(/\\x3C/g, '<')
     .replace(/\\x22/g, '"')
     .replace(/\\x27/g, "'")
     .replace(/\\x26/g, '&') /* These 5 replacements protect from HTML/XML. */
     .replace(/\\u00A0/g, '\u00A0') /* Useful but not absolutely necessary. */
     .replace(/\\n/g, '\n')
     .replace(/\\t/g, '\t') /* These 2 replacements protect whitespaces. */
     ;
 /*
 You may optionally add here support for other numerical or symbolic
 character escapes.
 But you can add these replacements only if these entities will *not* 
 be replaced by a string value containing *any* backslash character.
 Do not decode to any doubled backslashes here !
 */
 /* Required check for security! */
 var found = /\\[^\\]?/.match(s);
 if (found.length > 0 && found[0] != '\\\\')
     throw 'Unsafe or unsupported escape found in the literal string content';
 /* This MUST be the last replacement. */
 return s.replace(/\\\\/g, '\\');

}

Note that this function does not parse the surrounding quote delimiters which are used
to surround Javascript or JSON string literals. It’s your responsibility of parsing the Javascript or JSON source code to extract quoted strings literals, and to strip those matching quote delimiters before calling the unescape() function.

Alternative to quoteattr(), using only the DOM API:

Example 1 (generating only JavaScript, no HTML content generated):

Exemple 2 (generating valid HTML):

More Related Contents:

Leave a Comment Cancel reply

Alternative to `quoteattr()`, using only the DOM API: