我在通过抓取Google新闻标题清理一些数据时遇到了问题。
我有兴趣使用beautifulsoup库从我抓取谷歌新闻头条的列表中创建一个清晰的数据框。
我的列表看起来像这样,我称之为“约会”:
[<div class="slp"><span class="f">ESPN - 13 hours ago</span></div>, <div class="slp"><span class="f">ABS-CBN News - 13 hours ago</span></div>, <div class="slp"><span class="f">New York Times - 14 hours ago</span></div>, <div class="slp"><span class="f">MinnPost - 1 day ago</span></div>, <div class="slp"><span class="f">New York Times - 2 days ago</span></div>, <div class="slp"><span class="f">NME.com - 1 day ago</span></div>, <div class="slp"><span class="f">Wichita Eagle - 1 day ago</span></div>, <div class="slp"><span class="f">Jalopnik - 1 day ago</span></div>]
有没有办法可以遍历此列表并删除div标签?我想仅仅依靠报纸 - 列表中每个值的日期。
我尝试使用beautifulsoup的功能来做到这一点,但没有取得多大成功,我也尝试将我的列表变成熊猫数据框并使用像 df = df.replace('',“”)
并编写循环等但它们不起作用。
感谢您的阅读。
答案 0 :(得分:1)
尝试使用BeautifulSoup
查找元素span
,然后获取text
,如下所示:
import bs4
date_lst = ["""<div class="slp"><span class="f">ESPN - 13 hours ago</span></div>""",
"""<div class="slp"><span class="f">ABS-CBN News - 13 hours ago</span></div>""",
"""<div class="slp"><span class="f">New York Times - 14 hours ago</span></div>""",
"""<div class="slp"><span class="f">MinnPost - 1 day ago</span></div>""",
"""<div class="slp"><span class="f">New York Times - 2 days ago</span></div>""",
"""<div class="slp"><span class="f">NME.com - 1 day ago</span></div>""",
"""<div class="slp"><span class="f">Wichita Eagle - 1 day ago</span></div>""",
"""<div class="slp"><span class="f">Jalopnik - 1 day ago</span></div>"""]
date_result = []
for d in date_lst:
soup = bs4.BeautifulSoup(d, "html.parser")
date_result.append(soup.find('span').text)
print(date_result)
<强>更新强>
根据您的更新,日期包含<div class="slp">
元素,您可以直接循环查找span
并获取text
。
date_result = []
for d in dates:
date_result.append(d.find('span').text)
date_result将是:
[u'ESPN - 13 hours ago',
u'ABS-CBN News - 13 hours ago',
u'New York Times - 14 hours ago',
u'MinnPost - 1 day ago',
u'New York Times - 2 days ago',
u'NME.com - 1 day ago',
u'Wichita Eagle - 1 day ago',
u'Jalopnik - 1 day ago']
答案 1 :(得分:1)
或者您也可以采取以下方式:
from bs4 import BeautifulSoup
html_content="""
<div class="slp"><span class="f">ESPN - 13 hours ago</span></div>
<div class="slp"><span class="f">ABS-CBN News - 13 hours ago</span></div>
<div class="slp"><span class="f">New York Times - 14 hours ago</span></div>
<div class="slp"><span class="f">MinnPost - 1 day ago</span></div>
<div class="slp"><span class="f">New York Times - 2 days ago</span></div>
<div class="slp"><span class="f">NME.com - 1 day ago</span></div>
<div class="slp"><span class="f">Wichita Eagle - 1 day ago</span></div>
<div class="slp"><span class="f">Jalopnik - 1 day ago</span></div>
"""
soup = BeautifulSoup(html_content, "lxml")
for item in soup.select(".slp .f"):
print(item.text)
结果:
ESPN - 13 hours ago
ABS-CBN News - 13 hours ago
New York Times - 14 hours ago
MinnPost - 1 day ago
New York Times - 2 days ago
NME.com - 1 day ago
Wichita Eagle - 1 day ago
Jalopnik - 1 day ago