我试图使用beautifulsoup从beeradvocate中删除评论。审核代码如下所示:
[<span class="BAscore_norm">4.49</span>,
<span class="rAvg_norm">/5</span>,
u'\xa0\xa0rDev ',
<span style="color:#006600;">+2%</span>,
<br/>,
<span class="muted">look: 4.25 | smell: 4.5 | taste: 4.5 | feel: 4.5 |
overall: 4.5</span>,
<br/>,
<br/>,
u'Pours a slightly hazy golden orange with two fingers white head. ',
<br/>,
u'\nSmells of citrus, orange, pineapple, sweet malty presence.',
<br/>,
u'\nTastes starts with the juicy orange, pineapple. Finishes with a
somewhat sweet caramel toffee like malt presence.',
<br/>,
u'\nVery smooth medium body. Alchohol was very well hidden until it started
to warm a bit.',
<br/>,
u'\nOverall a really tasty brew!',
<br/>,
<br/>,
<i aria-hidden="true" class="fa fa-file-text-o"></i>,
u'\xa0',
<span class="muted">354 characters</span>,
<br/>,
<br/>,
<div><span class="muted"><a class="username"
href="/community/members/jbowengeorgia.1171914/">JBowenGeorgia</a>, <a
href="/beer/profile/26/1558/?ba=JBowenGeorgia#review">Oct 03, 2017</a>
</span></div>]
我对如何提取评论文本感到很遗憾。在Python BeautifulSoup extract text between element有一个类似的问题,但大多数答案都涉及。内容和位置参数,由于评论中段落之间的换行符而在这里不起作用。
答案 0 :(得分:0)
试试这个单行:
text = ''.join(x for x in soup if type(x) == bs4.NavigableString and not x.startswith(u'\xa0'))
此处soup
对应标记<div id="rating_fullview_content_2">
。我不知道你是否有这个变量,但是soup.content
匹配你在原始问题中给出的代码块。
答案 1 :(得分:0)
假设您将页面的初始html代码放入html
变量:
# -*- coding: utf-8 -*-
import bs4
if __name__=="__main__":
with open('page.html') as page:
html = page.read()
soup = bs4.BeautifulSoup(html, 'lxml')
reviews = soup.br.find_next_siblings(text=True)
reviews = map(lambda x: x.strip(), reviews) # remove whitespace
reviews = filter(lambda x: bool(x), reviews) # remove empty strings
for review in reviews:
print "REVIEW:", review
这会给你类似的东西:
REVIEW: Pours a slightly hazy golden orange with two fingers white head.
REVIEW: Smells of citrus, orange, pineapple, sweet malty presence.
REVIEW: Tastes starts with the juicy orange, pineapple. Finishes with a
somewhat sweet caramel toffee like malt presence.
REVIEW: Very smooth medium body. Alchohol was very well hidden until it started
to warm a bit.
REVIEW: Overall a really tasty brew!
REVIEW: \xa0