如何使用BeautifulSoup提取此文本?

时间:2017-11-03 09:17:01

标签: python beautifulsoup

我试图使用beautifulsoup从beeradvocate中删除评论。审核代码如下所示:

[<span class="BAscore_norm">4.49</span>,
 <span class="rAvg_norm">/5</span>,
 u'\xa0\xa0rDev ',
 <span style="color:#006600;">+2%</span>,
 <br/>,
 <span class="muted">look: 4.25 | smell: 4.5 | taste: 4.5 | feel: 4.5 |  
 overall: 4.5</span>,
 <br/>,
 <br/>,
 u'Pours a slightly hazy golden orange with two fingers white head. ',
 <br/>,
 u'\nSmells of citrus, orange, pineapple, sweet malty presence.',
 <br/>,
 u'\nTastes starts with the juicy orange, pineapple. Finishes with a 
 somewhat sweet caramel toffee like malt presence.',
 <br/>,
 u'\nVery smooth medium body. Alchohol was very well hidden until it started 
 to warm a bit.',
 <br/>,
 u'\nOverall a really tasty brew!',
 <br/>,
 <br/>,
 <i aria-hidden="true" class="fa fa-file-text-o"></i>,
 u'\xa0',
 <span class="muted">354 characters</span>,
 <br/>,
 <br/>,
 <div><span class="muted"><a class="username" 
href="/community/members/jbowengeorgia.1171914/">JBowenGeorgia</a>, <a 
href="/beer/profile/26/1558/?ba=JBowenGeorgia#review">Oct 03, 2017</a>
</span></div>]

我对如何提取评论文本感到很遗憾。在Python BeautifulSoup extract text between element有一个类似的问题,但大多数答案都涉及。内容和位置参数,由于评论中段落之间的换行符而在这里不起作用。

2 个答案:

答案 0 :(得分:0)

试试这个单行:

text = ''.join(x for x in soup if type(x) == bs4.NavigableString and not x.startswith(u'\xa0'))

此处soup对应标记<div id="rating_fullview_content_2">。我不知道你是否有这个变量,但是soup.content匹配你在原始问题中给出的代码块。

答案 1 :(得分:0)

假设您将页面的初始html代码放入html变量:

# -*- coding: utf-8 -*-

import bs4

if __name__=="__main__":
    with open('page.html') as page:
        html = page.read()
        soup = bs4.BeautifulSoup(html, 'lxml')

        reviews = soup.br.find_next_siblings(text=True)
        reviews = map(lambda x: x.strip(), reviews)  # remove whitespace
        reviews = filter(lambda x: bool(x), reviews)  # remove empty strings

        for review in reviews:
            print "REVIEW:", review

这会给你类似的东西:

REVIEW: Pours a slightly hazy golden orange with two fingers white head.
REVIEW: Smells of citrus, orange, pineapple, sweet malty presence.
REVIEW: Tastes starts with the juicy orange, pineapple. Finishes with a
 somewhat sweet caramel toffee like malt presence.
REVIEW: Very smooth medium body. Alchohol was very well hidden until it started
to warm a bit.
REVIEW: Overall a really tasty brew!
REVIEW: \xa0