Using the XML package parse the input with xmlTreeParse
and then use xpathSApply
to interate over the product_price
class div
nodes. For each such node the anonyous function gets the value of the div
and p
subnodes. The resulting character matrix m
is reworked into a data frame DF
and the columns are cleaned removing any character that is not a dot or digit and also removing any dot followed by a non-digit. Copnvert result to numeric. Note that no special processing for the missing p
case is needed.
# input
Lines <- '<html>
<head></head>
<body>
<div class="product_price" id="product_price_186251">
<p class="normal_encontrado">
S/. 2,799.00
</p>
<div id="WC_CatalogEntryDBThumbnailDisplayJSPF_10461_div_10" class="price">
S/. 2,299.00
</div>
</div>
<div class="product_price" id="product_price_232046">
<div id="WC_CatalogEntryDBThumbnailDisplayJSPF_10461_div_10" class="price">
S/. 4,999.00
</div>
</div>
</body>
</html>'
# code to read input and produce a data.frame
library(XML)
doc <- xmlTreeParse(Lines, asText = TRUE, useInternalNodes = TRUE)
m <- xpathSApply(doc, "//div[@class="product_price"]", function(node) {
list(p = xmlValue(node[["p"]]), div = xmlValue(node[["div"]])) })
DF <- as.data.frame(t(m), stringsAsFactors = FALSE) # rework into data frame
DF[] <- lapply(DF, function(x) as.numeric(gsub("[^.0-9]|[.]\\D", "", x))) # clean
The result is:
> DF
p div
1 2799 2299
2 NA 4999