Dealing with malformed XML [duplicate]

I know this isn’t the answer you want – but the XML spec is quite clear and strict.

Malformed XML is fatal.

If it doesn’t work in a validator, then your code should not even attempt to “fix” it, any more than you’d try and automatically ‘fix’ some program code.

From The Anotated XML Specification:

fatal error
[Definition:] An error which a conforming XML processor must detect and report to the application. After encountering a fatal error, the processor may continue processing the data to search for further errors and may report such errors to the application. In order to support correction of errors, the processor may make unprocessed data from the document (with intermingled character data and markup) available to the application. Once a fatal error is detected, however, the processor must not continue normal processing (i.e., it must not continue to pass character data and information about the document’s logical structure to the application in the normal way).

And specifically the commentary on why: “Draconian” error-handling

We want XML to empower programmers to write code that can be transmitted across the Web and execute on a large number of desktops. However, if this code must include error-handling for all sorts of sloppy end-user practices, it will of necessity balloon in size to the point where it, like Netscape Navigator, or Microsoft Internet Explorer, is tens of megabytes in size, thus defeating the purpose.

If you’ve ever tried to put together a parser for HTML, you’ll realise why it needs to be this way – you end up writing SO MANY handlers for edge cases, bad tag nestings, implict tag closure that your code is a mess right from the start.

And because it’s my favourite post on Stack Overflow – here is an example of why: RegEx match open tags except XHTML self-contained tags

Now I appreciate this isn’t always an option, and you probably wouldn’t come here if asking your upstream ‘fix your XML’ was the path of least resistance. However I would still urge you to report it as defect in the XML originating application and as much as possible resist pressure to ‘fix’ programatically – because as you’ve rightly figured out, it’s building yourself a world of pain when the right answer is ‘fix the problem at source’.

If you are really stuck on this road, you can – as Sinan Ünür points out – your only option is to trap where you parser failed, and then inspect and try to repair as you go. But you won’t find an XML parser that’ll do it for you, because the one that do are by definition broken.

I would suggest that first you:

Dig out a copy of the spec, to show to whoever’s asked you to do this.
point out to them that the whole reason we have standards is to promote interoperability.
Therefore that by doing something that deliberately violates the standard, you are taking a business risk – you are creating code that may one day mysteriously break, because using things like regular expressions or automatic fixing is building in a set of assumptions that may not hold true.
A useful concept here is technical debt – explain you’re incurring technical debt by automatic fixing, for something that’s really not your problem.
Then ask them if they wish to accept that risk.
If they do think that’s an acceptable risk, then just get on with it – you may find it worth – effectively – ignoring the fact that your source data looks like XML and treat it as if it were plain text – use regular expressions to extract pertinent data lines, etc.
Stick an apology in the comments to your future maintenance programmer, explaining who made the decision and why.

Also might be useful as a reference point: Which character should not be set as values in XML file

More Related Contents:

Leave a Comment Cancel reply