Alternatively to the regex-based approach, you can parse the javascript code using slimit
module, that builds an Abstract Syntax Tree and gives you a way of getting all assignments and putting them into the dictionary:
from bs4 import BeautifulSoup
from slimit import ast
from slimit.parser import Parser
from slimit.visitors import nodevisitor
data = """
<html>
<head>
<title>My Sample Page</title>
<script>
$.ajax({
type: "POST",
url: 'http://www.example.com',
data: {
email: '[email protected]',
phone: '9999999999',
name: 'XYZ'
}
});
</script>
</head>
<body>
<h1>What a wonderful world</h1>
</body>
</html>
"""
# get the script tag contents from the html
soup = BeautifulSoup(data)
script = soup.find('script')
# parse js
parser = Parser()
tree = parser.parse(script.text)
fields = {getattr(node.left, 'value', ''): getattr(node.right, 'value', '')
for node in nodevisitor.visit(tree)
if isinstance(node, ast.Assign)}
print fields
Prints:
{u'name': u"'XYZ'", u'url': u"'http://www.example.com'", u'type': u'"POST"', u'phone': u"'9999999999'", u'data': '', u'email': u"'[email protected]'"}
Among other fields, there are email
, name
and phone
that you are interested in.
Hope that helps.