Question

我有一个以下格式的Html文档。

<p>&nbsp;&nbsp;&nbsp;1. Content of the paragraph <i> in italic </i> but not <b> strong </b> <a href="url">ignore</a>.</p>

我想提取段落标记的内容，包括斜体和粗体标记的内容，但不包含锚标记的内容。此外，可能在开头忽略数字。

预期输出为：该段落的内容以斜体显示但不强烈。

最好的方法是什么？

此外，以下代码段返回 TypeError：类型为'NoneType'的参数不可迭代

soup = BSoup(page)
for p in soup.findAll('p'):
    if '&nbsp;&nbsp;&nbsp;' in p.string:
        print p

感谢您的建议。

Answer 1

您的代码失败，因为如果代码只有一个孩子并且该子代为tag.string

，则会设置NavigableString

您可以通过提取a代码来实现您的目标：

from BeautifulSoup import BeautifulSoup

s = """<p>&nbsp;&nbsp;&nbsp;1. Content of the paragraph <i> in italic </i> but not <b> strong </b> <a href="url">ignore</a>.</p>"""
soup = BeautifulSoup(s, convertEntities=BeautifulSoup.HTML_ENTITIES)

for p in soup.findAll('p'):
    for a in p.findAll('a'):
        a.extract()
    print ''.join(p.findAll(text=True))

Answer 2

我认为你只需要遍历p内的标签并收集所需的字符串。

使用lxml，您可以使用XPath：

import lxml.html as LH
import re

content = '''\
<p>&nbsp;&nbsp;&nbsp;1. Content of the paragraph <i> in italic </i> but not <b> strong </b> <a href="url">ignore</a>.</p>'''

doc = LH.fromstring(content)
ptext = ''.join(doc.xpath('//p/descendant-or-self::*[not(self::a)]/text()'))
pat = r'^.*\d+.\s*'
print(re.sub(pat,'',ptext))

的产率：

Content of the paragraph  in italic  but not  strong  .

Answer 3

您遇到的string问题是string，如documentation中所述，仅提供：

如果标记只有一个子节点，并且该子节点是一个字符串

因此，在您的情况下，p.string为None，您无法对其进行迭代。要访问标记内容，您必须使用p.contents（这是包含标记的列表）或p.text（这是删除了所有标记的字符串）。

在你的情况下，你可能正在寻找这样的东西：

>>> ''.join([str(e) for e in soup.p.contents
                    if not isinstance(e, BeautifulSoup.Tag)
                       or e.name != 'a'])
>>> '&nbsp;&nbsp;&nbsp;1. Content of the paragraph <i> in italic </i> but not <b> strong </b> .'

如果你还需要删除'''前缀，我会使用正则表达式从最终字符串中删除该部分。

Answer 4

http://www.crummy.com/software/BeautifulSoup/bs4/doc/#get-text

如果您只想要文档或标记的文本部分，则可以使用get_text（）方法。它返回文档中或标记下的所有文本，作为单个Unicode字符串。（在上面链接的文档中给出）

使用BeautifulSoup基于内容值提取标记内容

4 个答案: