Why does this xpath fail using lxml in python?

1. Browsers frequently change the HTML

Browsers quite frequently change the HTML served to it to make it “valid”. For example, if you serve a browser this invalid HTML:

<table>
  <p>bad paragraph</p>
  <tr><td>Note that cells and rows can be unclosed (and valid) in HTML
</table>

To render it, the browser is helpful and tries to make it valid HTML and may convert this to:

<p>bad paragraph</p>
<table>
  <tbody>
    <tr>
      <td>Note that cells and rows can be unclosed (and valid) in HTML</td>
    </tr>
  </tbody>
</table>

The above is changed because <p>aragraphs cannot be inside <table>s and <tbody>s are recommended. What changes are applied to the source can vary wildly by browser. Some will put invalid elements before tables, some after, some inside cells, etc…

2. Xpaths aren’t fixed, they are flexible in pointing to elements.

Using this ‘fixed’ HTML:

<p>bad paragraph</p>
<table>
  <tbody>
    <tr>
      <td>Note that cells and rows can be unclosed (and valid) in HTML</td>
    </tr>
  </tbody>
</table>

If we try to target the text of <td> cell, all of the following will give you approximately the right information:

//td
//tr/td
//tbody/tr/td
/table/tbody/tr/td
/table//*/text()

And the list goes on…

however, in general browser will give you the most precise (and least flexible) XPath that lists every element from the DOM. In this case:

/table[0]/tbody[0]/tr[0]/td[0]/text()

3. Conclusion: Browser given Xpaths are usually unhelpful

This is why the XPaths produced by developer tools will frequently give you the wrong Xpath when trying to use the raw HTML.

The solution, always refer to the raw HTML and use a flexible, but precise XPath.

Examine the actual HTML that holds the price:

<table border="0" cellspacing="0" cellpadding="0">
    <tr>
        <td>
            <font class="pricecolor colors_productprice">
                <div class="product_productprice">
                    <b>
                        <font class="text colors_text">Price:</font>
                        <span itemprop="price">$149.95</span>
                    </b>
                </div>
            </font>
            <br/>
            <input type="image" src="https://stackoverflow.com/v/vspfiles/templates/MAKO/images/buttons/btn_updateprice.gif" name="btnupdateprice" alt="Update Price" border="0"/>
        </td>
    </tr>
</table>

If you want the price, there is actually only one place to look!

//span[@itemprop="price"]/text()

And this will return:

$149.95

Leave a Comment