文本信息没有正确地抓取 - Python

时间:2016-12-29 15:20:03

标签: python html beautifulsoup html-parsing

我需要在以下HTML之间删除文本信息。我的代码在标签和类名相同的情况下无法正常工作。在这里,我需要在单个列表元素中获取文本,而不是作为两个不同的列表元素。我在这里写的代码是针对没有如下分割的情况。在我的情况下,我需要抓取两种文本并将其附加到单个列表中。

示例HTML代码(其中list元素为1) - 正常工作:

<DIV CLASS="c5"><P CLASS="c9"><SPAN CLASS="c2">The board of Hillshire Brands has withdrawn its recommendation to acquire frozen foods maker Pinnacle Foods, clearing the way for Tyson Foods' $8.55bn takeover bid.</SPAN><SPAN CLASS="c2">&nbsp;</SPAN></P>
<P CLASS="c9"><SPAN CLASS="c2">Last Monday Tyson won the bidding war for Hillshire, maker of Ball Park hot dogs, with a $63-a-share offer, topping rival poultry processor Pilgrim's Pride's $7.7bn bid.</SPAN></P>

示例HTML代码(其中list元素为两个):

<DIV CLASS="c5"><BR><P CLASS="c6"><SPAN CLASS="c8">HIGHLIGHT:</SPAN><SPAN CLASS="c2">&nbsp;News analysis<BR></SPAN></P>
<P CLASS="c9"><SPAN CLASS="c2">M&amp;A simmers as producers swallow up brands to win shelf space, writes Neil Munhsi</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c9"><SPAN CLASS="c2">Pickles may go with sandwiches, as Hillshire Brands chief executive Sean Connolly put it two weeks ago.</SPAN></P>
<P CLASS="c9"><SPAN CLASS="c2">But many were puzzled by the US food group's announcement that it would pay $6.6bn to acquire New Jersey-based rival Pinnacle Foods, maker of Vlasic pickles and Birds Eye frozen food.</SPAN></P>
<P CLASS="c9"><SPAN CLASS="c2">Without the sort of mooted cost savings necessary to justify the purchase price, many saw the move by Hillshire, known in the US for Ball Park hot dogs and Jimmy Dean sausages, as a way to head off a potential takeover.</SPAN><SPAN CLASS="c2">&nbsp;</SPAN></P>

Python代码:

soup = BeautifulSoup(response, 'html.parser')
tree = html.fromstring(response)
values = [[''.join(text for text in div.xpath('.//p[@class="c9"]//span[@class="c2"]//text()'))] for div in tree.xpath('//div[@class="c5"]') if div.getchildren()]
        split_at = ','
textvalues = [list(g) for k, g in groupby(values, lambda x: x != split_at) if k]
list2 = [x for x in textvalues[0] if x]
def purify(list2):
     for (i, sl) in enumerate(list2):
          if type(sl) == list:
              list2[i] = purify(sl)
            return [i for i in list2 if i != [] and i != '']
list3=purify(list2)
flattened = [val for sublist in list3 for val in sublist]

当前输出:

["M&A simmers as producers swallow up brands to win shelf space, writes Neil Munhsi","--Remaining text--"]

预期的样本输出:

["M&A simmers as producers swallow up brands to win shelf space, writes Neil Munhsi --Remaining text--"]

请帮我解决上述问题。

2 个答案:

答案 0 :(得分:3)

这样的东西?

#!/bin/bash
cd /c/path/to/local/repo
d=$(git describe)
echo "\"${d}\"">file.txt

输出:

from bs4 import BeautifulSoup
a="""
<DIV CLASS="c5"><BR><P CLASS="c6"><SPAN CLASS="c8">HIGHLIGHT:</SPAN><SPAN CLASS="c2">&nbsp;News analysis<BR></SPAN></P>
<P CLASS="c9"><SPAN CLASS="c2">M&amp;A simmers as producers swallow up brands to win shelf space, writes Neil Munhsi</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c9"><SPAN CLASS="c2">Pickles may go with sandwiches, as Hillshire Brands chief executive Sean Connolly put it two weeks ago.</SPAN></P>
<P CLASS="c9"><SPAN CLASS="c2">But many were puzzled by the US food group's announcement that it would pay $6.6bn to acquire New Jersey-based rival Pinnacle Foods, maker of Vlasic pickles and Birds Eye frozen food.</SPAN></P>
<P CLASS="c9"><SPAN CLASS="c2">Without the sort of mooted cost savings necessary to justify the purchase price, many saw the move by Hillshire, known in the US for Ball Park hot dogs and Jimmy Dean sausages, as a way to head off a potential takeover.</SPAN><SPAN CLASS="c2">&nbsp;</SPAN></P>
"""
l = BeautifulSoup(a).text.split('\n')
b = [' '.join(l[1:])]
print b

答案 1 :(得分:0)

text = '''<DIV CLASS="c5"><BR><P CLASS="c6"><SPAN CLASS="c8">HIGHLIGHT:</SPAN><SPAN CLASS="c2">&nbsp;News analysis<BR></SPAN></P>
<P CLASS="c9"><SPAN CLASS="c2">M&amp;A simmers as producers swallow up brands to win shelf space, writes Neil Munhsi</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c9"><SPAN CLASS="c2">Pickles may go with sandwiches, as Hillshire Brands chief executive Sean Connolly put it two weeks ago.</SPAN></P>
<P CLASS="c9"><SPAN CLASS="c2">But many were puzzled by the US food group's announcement that it would pay $6.6bn to acquire New Jersey-based rival Pinnacle Foods, maker of Vlasic pickles and Birds Eye frozen food.</SPAN></P>
<P CLASS="c9"><SPAN CLASS="c2">Without the sort of mooted cost savings necessary to justify the purchase price, many saw the move by Hillshire, known in the US for Ball Park hot dogs and Jimmy Dean sausages, as a way to head off a potential takeover.</SPAN><SPAN CLASS="c2">&nbsp;</SPAN></P>'''

html = etree.HTML(text)

res = html.xpath('//span[@class="c2" and ../@class="c9"]/text()')

print([''.join(res)])

出:

 ["M&A simmers as producers swallow up brands to win shelf space, writes Neil MunhsiPickles may go with sandwiches, as Hillshire Brands chief executive Sean Connolly put it two weeks ago.But many were puzzled by the US food group's announcement that it would pay $6.6bn to acquire New Jersey-based rival Pinnacle Foods, maker of Vlasic pickles and Birds Eye frozen food.Without the sort of mooted cost savings necessary to justify the purchase price, many saw the move by Hillshire, known in the US for Ball Park hot dogs and Jimmy Dean sausages, as a way to head off a potential takeover.\xa0"]