Question

我试图废弃一个论坛，但我无法处理这些评论，因为用户使用表情符号和粗体字体，并引用以前的消息，以及... ...

例如，这是我遇到问题的评论之一：

<div class="content">
    <blockquote>
        <div>
            <cite>User write:</cite>
               I DO NOT WANT THIS  <img class="smilies" alt=":116:" title="116">
        </div>
    </blockquote>
    <br/>
    THIS IS THE COMMENT THAT I NEED!
</div>

我在过去的4天里寻求帮助而我找不到任何东西，所以我决定在这里问。

这是我正在使用的代码：

def make_soup(url):
    html = urlopen(url).read()
    return BeautifulSoup(html, "lxml")

def get_messages(url):
    soup = make_soup(url)

    msg = soup.find("div", {"class" : "content"})

    # I get in msg the hole message, exactly as I wrote previously
    print msg

    # Here I get:
    # 1. <blockquote> ... </blockquote>
    # 2. <br/>
    # 3. THIS IS THE COMMENT THAT I NEED!
    for item in msg.children:
        print item

我正在寻找一种以一般方式处理消息的方法，无论它们如何。有时他们会在文本之间添加表情符号，我需要删除它们并获取空洞消息（在这种情况下，bsp会将消息的每个部分（第一部分，表情符号，第二部分）放在不同的项目中。）

提前致谢！

Answer 1

使用decompose http://www.crummy.com/software/BeautifulSoup/bs4/doc/#decompose

分解您不想要的提取标签。在你的情况下：

soup.blockquote.decompose()

或所有不需要的标签：

for tag in ['blockquote', 'img', ... ]:
    soup.find(tag).decompose()

你的例子：

>>> from bs4 import BeautifulSoup
>>> html = """<div class="content">
...     <blockquote>
...         <div>
...             <cite>User write:</cite>
...                I DO NOT WANT THIS  <img class="smilies" alt=":116:"    title="116">
...         </div>
...     </blockquote>
...     <br/>
...     THIS IS THE COMMENT THAT I NEED!
... </div>"""
>>> soup = BeautifulSoup(html, 'html.parser')
>>> soup.find('blockquote').decompose()
>>> soup.find("div", {"class" : "content"}).text.strip()
u'THIS IS THE COMMENT THAT I NEED!'

更新

有时您拥有的只是一个标记起点，但您实际上对之前的或之后的内容感兴趣起点。您可以使用.next_sibling和.previous_sibling在解析树的同一级别的页面元素之间导航：

>>> html = """<div>No<blockquote>No</blockquote>Yes.<em>Yes!</em>Yes?</div>No!""" >>> soup = BeautifulSoup(html, 'html.parser') >>> elm = soup.blockquote.next_sibling >>> txt = "" >>> while elm: ... txt += elm.string ... elm = elm.next_sibling ... >>> print(txt) u'Yes.Yes!Yes?'

Answer 2

BeautifulSoup有get_text方法。 Maybe this is what you want.

从他们的文件：

markup = '<a href="http://example.com/">\nI linked to <i>example.com</i>\n</a>'
soup = BeautifulSoup(markup)

soup.get_text()
u'\nI linked to example.com\n'
soup.i.get_text()
u'example.com'

Answer 3

如果你想要的文字永远不在任何其他标签内，如你的例子所示，你可以使用extract()来删除所有标签及其内容：

html = '<div class="content">\
    <blockquote>\
        <div>\
            <cite>User write:</cite>\
               I DO NOT WANT THIS  <img class="smilies" alt=":116:" title="116">\
        </div>\
    </blockquote>\
    <br/>\
    THIS IS THE COMMENT THAT I NEED!\
</div>'

from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'lxml')
div = soup.find('div', class_='content')
tags = div.findAll(recursive=False)
for tag in tags:
  tag.extract()
text = div.get_text(strip=True)
print(text)

这给出了：

THIS IS THE COMMENT THAT I NEED!

要处理表情符号，你必须做一些更复杂的事情。您可能需要定义表情符号列表才能识别自己，然后解析文本以查找它们。

使用BeautifulSoup和Python提取内容

3 个答案:

更新