我试图废弃一个论坛,但我无法处理这些评论,因为用户使用表情符号和粗体字体,并引用以前的消息,以及... ...
例如,这是我遇到问题的评论之一:
<div class="content">
<blockquote>
<div>
<cite>User write:</cite>
I DO NOT WANT THIS <img class="smilies" alt=":116:" title="116">
</div>
</blockquote>
<br/>
THIS IS THE COMMENT THAT I NEED!
</div>
我在过去的4天里寻求帮助而我找不到任何东西,所以我决定在这里问。
这是我正在使用的代码:
def make_soup(url):
html = urlopen(url).read()
return BeautifulSoup(html, "lxml")
def get_messages(url):
soup = make_soup(url)
msg = soup.find("div", {"class" : "content"})
# I get in msg the hole message, exactly as I wrote previously
print msg
# Here I get:
# 1. <blockquote> ... </blockquote>
# 2. <br/>
# 3. THIS IS THE COMMENT THAT I NEED!
for item in msg.children:
print item
我正在寻找一种以一般方式处理消息的方法,无论它们如何。有时他们会在文本之间添加表情符号,我需要删除它们并获取空洞消息(在这种情况下,bsp会将消息的每个部分(第一部分,表情符号,第二部分)放在不同的项目中。)
提前致谢!
答案 0 :(得分:1)
使用decompose
http://www.crummy.com/software/BeautifulSoup/bs4/doc/#decompose
分解您不想要的提取标签。在你的情况下:
soup.blockquote.decompose()
或所有不需要的标签:
for tag in ['blockquote', 'img', ... ]:
soup.find(tag).decompose()
你的例子:
>>> from bs4 import BeautifulSoup
>>> html = """<div class="content">
... <blockquote>
... <div>
... <cite>User write:</cite>
... I DO NOT WANT THIS <img class="smilies" alt=":116:" title="116">
... </div>
... </blockquote>
... <br/>
... THIS IS THE COMMENT THAT I NEED!
... </div>"""
>>> soup = BeautifulSoup(html, 'html.parser')
>>> soup.find('blockquote').decompose()
>>> soup.find("div", {"class" : "content"}).text.strip()
u'THIS IS THE COMMENT THAT I NEED!'
有时您拥有的只是一个标记起点,但您实际上对之前的或之后的内容感兴趣起点。您可以使用.next_sibling
和.previous_sibling
在解析树的同一级别的页面元素之间导航:
>>> html = """<div>No<blockquote>No</blockquote>Yes.<em>Yes!</em>Yes?</div>No!"""
>>> soup = BeautifulSoup(html, 'html.parser')
>>> elm = soup.blockquote.next_sibling
>>> txt = ""
>>> while elm:
... txt += elm.string
... elm = elm.next_sibling
...
>>> print(txt)
u'Yes.Yes!Yes?'
答案 1 :(得分:0)
BeautifulSoup有get_text
方法。 Maybe this is what you want.
从他们的文件:
markup = '<a href="http://example.com/">\nI linked to <i>example.com</i>\n</a>'
soup = BeautifulSoup(markup)
soup.get_text()
u'\nI linked to example.com\n'
soup.i.get_text()
u'example.com'
答案 2 :(得分:0)
如果你想要的文字永远不在任何其他标签内,如你的例子所示,你可以使用extract()
来删除所有标签及其内容:
html = '<div class="content">\
<blockquote>\
<div>\
<cite>User write:</cite>\
I DO NOT WANT THIS <img class="smilies" alt=":116:" title="116">\
</div>\
</blockquote>\
<br/>\
THIS IS THE COMMENT THAT I NEED!\
</div>'
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
div = soup.find('div', class_='content')
tags = div.findAll(recursive=False)
for tag in tags:
tag.extract()
text = div.get_text(strip=True)
print(text)
这给出了:
THIS IS THE COMMENT THAT I NEED!
要处理表情符号,你必须做一些更复杂的事情。您可能需要定义表情符号列表才能识别自己,然后解析文本以查找它们。