How can i grab CData out of BeautifulSoup

One thing you need to be careful of BeautifulSoup grabbing CData is not to use a lxml parser.

By default, the lxml parser will strip CDATA sections from the tree and replace them by their plain text content, Learn more here

#Trying it with html.parser


>>> from bs4 import BeautifulSoup
>>> import bs4
>>> s=""'<?xml version="1.0" ?>
<foo>
    <bar><![CDATA[
        aaaaaaaaaaaaa
    ]]></bar>
</foo>'''
>>> soup = BeautifulSoup(s, "html.parser")
>>> soup.find(text=lambda tag: isinstance(tag, bs4.CData)).string.strip()
'aaaaaaaaaaaaa'
>>> 

Leave a Comment