标签内容不在beautifulsoup中返回

时间:2017-06-07 23:36:59

标签: python beautifulsoup lxml

我有以下字符串我试图提取:

<item>
<dc:creator><![CDATA[Chris M]]></dc:creator>
<pubDate>Tue, 06 Jun 2017 07:38:23 +0000</pubDate>
</item>

我试图将Chris M和其他作者的名字改为:

soup = BeautifulSoup(response, "lxml")
items = soup.findAll("item")
            for i in items:
                 author = i.find('dc:creator')
                 print author

输出:

<dc:creator></dc:creator>

如何从标签中获取名称内容?

2 个答案:

答案 0 :(得分:0)

这对我使用Python 3 - https://repl.it/languages/python3

起作用了

将解析器指定为xml

import bs4 as bs
content="""
<collection>
    <item><dc:creator><![CDATA[Chris M]]></dc:creator></item>
    <item><dc:creator><![CDATA[Harris A]]></dc:creator></item>
</collection>
"""

soup = bs.BeautifulSoup(content, 'xml')

items = soup.findAll("item")
for i in items:
   author = i.find('creator')
   print(author.string)

输出:

Chris M
Harris A

答案 1 :(得分:0)

BeautifulSoup将CData识别为子类,以便您可以检查它的实例。

>>> from bs4 import BeautifulSoup, CData

>>> text = """<item>
<dc:creator><![CDATA[Chris M]]></dc:creator>
<pubDate>Tue, 06 Jun 2017 07:38:23 +0000</pubDate>
</item>"""
>>> soup = BeautifulSoup(text)
>>> for item in soup.findAll(text=True):
        if isinstance(item, CData):
            print(item)


Chris M