使用BeautifulSoup和Python提取内容

时间:2015-11-04 21:48:51

标签: python web-scraping beautifulsoup

我试图废弃一个论坛,但我无法处理这些评论,因为用户使用表情符号和粗体字体,并引用以前的消息,以及... ...

例如,这是我遇到问题的评论之一:

<div class="content">
    <blockquote>
        <div>
            <cite>User write:</cite>
               I DO NOT WANT THIS  <img class="smilies" alt=":116:" title="116">
        </div>
    </blockquote>
    <br/>
    THIS IS THE COMMENT THAT I NEED!
</div>

我在过去的4天里寻求帮助而我找不到任何东西,所以我决定在这里问。

这是我正在使用的代码:

def make_soup(url):
    html = urlopen(url).read()
    return BeautifulSoup(html, "lxml")

def get_messages(url):
    soup = make_soup(url)

    msg = soup.find("div", {"class" : "content"})

    # I get in msg the hole message, exactly as I wrote previously
    print msg

    # Here I get:
    # 1. <blockquote> ... </blockquote>
    # 2. <br/>
    # 3. THIS IS THE COMMENT THAT I NEED!
    for item in msg.children:
        print item

我正在寻找一种以一般方式处理消息的方法,无论它们如何。有时他们会在文本之间添加表情符号,我需要删除它们并获取空洞消息(在这种情况下,bsp会将消息的每个部分(第一部分,表情符号,第二部分)放在不同的项目中。)

提前致谢!

3 个答案:

答案 0 :(得分:1)

使用decompose http://www.crummy.com/software/BeautifulSoup/bs4/doc/#decompose

分解您不想要的提取标签。在你的情况下:

soup.blockquote.decompose()

或所有不需要的标签:

for tag in ['blockquote', 'img', ... ]:
    soup.find(tag).decompose()

你的例子:

>>> from bs4 import BeautifulSoup
>>> html = """<div class="content">
...     <blockquote>
...         <div>
...             <cite>User write:</cite>
...                I DO NOT WANT THIS  <img class="smilies" alt=":116:"    title="116">
...         </div>
...     </blockquote>
...     <br/>
...     THIS IS THE COMMENT THAT I NEED!
... </div>"""
>>> soup = BeautifulSoup(html, 'html.parser')
>>> soup.find('blockquote').decompose()
>>> soup.find("div", {"class" : "content"}).text.strip()
u'THIS IS THE COMMENT THAT I NEED!'

更新

有时您拥有的只是一个标记起点,但您实际上对之前的之后的内容感兴趣起点。您可以使用.next_sibling.previous_sibling在解析树的同一级别的页面元素之间导航:

>>> html = """<div>No<blockquote>No</blockquote>Yes.<em>Yes!</em>Yes?</div>No!"""
>>> soup = BeautifulSoup(html, 'html.parser')
>>> elm = soup.blockquote.next_sibling
>>> txt = ""
>>> while elm:
...    txt += elm.string
...    elm = elm.next_sibling
... 
>>> print(txt)
u'Yes.Yes!Yes?'

答案 1 :(得分:0)

BeautifulSoup有get_text方法。 Maybe this is what you want.

从他们的文件:

markup = '<a href="http://example.com/">\nI linked to <i>example.com</i>\n</a>'
soup = BeautifulSoup(markup)

soup.get_text()
u'\nI linked to example.com\n'
soup.i.get_text()
u'example.com'

答案 2 :(得分:0)

如果你想要的文字永远不在任何其他标签内,如你的例子所示,你可以使用extract()来删除所有标签及其内容:

html = '<div class="content">\
    <blockquote>\
        <div>\
            <cite>User write:</cite>\
               I DO NOT WANT THIS  <img class="smilies" alt=":116:" title="116">\
        </div>\
    </blockquote>\
    <br/>\
    THIS IS THE COMMENT THAT I NEED!\
</div>'

from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'lxml')
div = soup.find('div', class_='content')
tags = div.findAll(recursive=False)
for tag in tags:
  tag.extract()
text = div.get_text(strip=True)
print(text)

这给出了:

THIS IS THE COMMENT THAT I NEED!

要处理表情符号,你必须做一些更复杂的事情。您可能需要定义表情符号列表才能识别自己,然后解析文本以查找它们。