迭代包含文本的列表以消除某些值

时间:2017-11-11 02:46:53

标签: python pandas numpy beautifulsoup

我在通过抓取Google新闻标题清理一些数据时遇到了问题。

我有兴趣使用beautifulsoup库从我抓取谷歌新闻头条的列表中创建一个清晰的数据框。

我的列表看起来像这样,我称之为“约会”:

[<div class="slp"><span class="f">ESPN - 13 hours ago</span></div>, <div class="slp"><span class="f">ABS-CBN News - 13 hours ago</span></div>, <div class="slp"><span class="f">New York Times - 14 hours ago</span></div>, <div class="slp"><span class="f">MinnPost - 1 day ago</span></div>, <div class="slp"><span class="f">New York Times - 2 days ago</span></div>, <div class="slp"><span class="f">NME.com - 1 day ago</span></div>, <div class="slp"><span class="f">Wichita Eagle - 1 day ago</span></div>, <div class="slp"><span class="f">Jalopnik - 1 day ago</span></div>]

有没有办法可以遍历此列表并删除div标签?我想仅仅依靠报纸 - 列表中每个值的日期。

我尝试使用beautifulsoup的功能来做到这一点,但没有取得多大成功,我也尝试将我的列表变成熊猫数据框并使用像      df = df.replace('',“”)

并编写循环等但它们不起作用。

感谢您的阅读。

2 个答案:

答案 0 :(得分:1)

尝试使用BeautifulSoup查找元素span,然后获取text,如下所示:

import bs4
date_lst = ["""<div class="slp"><span class="f">ESPN - 13 hours ago</span></div>""", 
            """<div class="slp"><span class="f">ABS-CBN News - 13 hours ago</span></div>""",
            """<div class="slp"><span class="f">New York Times - 14 hours ago</span></div>""", 
            """<div class="slp"><span class="f">MinnPost - 1 day ago</span></div>""", 
            """<div class="slp"><span class="f">New York Times - 2 days ago</span></div>""",
            """<div class="slp"><span class="f">NME.com - 1 day ago</span></div>""",
            """<div class="slp"><span class="f">Wichita Eagle - 1 day ago</span></div>""", 
            """<div class="slp"><span class="f">Jalopnik - 1 day ago</span></div>"""]
date_result  = []
for d in date_lst:
    soup = bs4.BeautifulSoup(d, "html.parser")
    date_result.append(soup.find('span').text)
print(date_result)

<强>更新 根据您的更新,日期包含<div class="slp">元素,您可以直接循环查找span并获取text

date_result  = []
for d in dates:
    date_result.append(d.find('span').text)

date_result将是:

[u'ESPN - 13 hours ago',
 u'ABS-CBN News - 13 hours ago',
 u'New York Times - 14 hours ago',
 u'MinnPost - 1 day ago',
 u'New York Times - 2 days ago',
 u'NME.com - 1 day ago',
 u'Wichita Eagle - 1 day ago',
 u'Jalopnik - 1 day ago']

答案 1 :(得分:1)

或者您也可以采取以下方式:

from bs4 import BeautifulSoup

html_content="""
<div class="slp"><span class="f">ESPN - 13 hours ago</span></div> 
<div class="slp"><span class="f">ABS-CBN News - 13 hours ago</span></div>
<div class="slp"><span class="f">New York Times - 14 hours ago</span></div>
<div class="slp"><span class="f">MinnPost - 1 day ago</span></div>
<div class="slp"><span class="f">New York Times - 2 days ago</span></div>
<div class="slp"><span class="f">NME.com - 1 day ago</span></div>
<div class="slp"><span class="f">Wichita Eagle - 1 day ago</span></div>
<div class="slp"><span class="f">Jalopnik - 1 day ago</span></div>
"""
soup = BeautifulSoup(html_content, "lxml")
for item in soup.select(".slp .f"):
    print(item.text)

结果:

ESPN - 13 hours ago
ABS-CBN News - 13 hours ago
New York Times - 14 hours ago
MinnPost - 1 day ago
New York Times - 2 days ago
NME.com - 1 day ago
Wichita Eagle - 1 day ago
Jalopnik - 1 day ago