使用python BeautifulSoup,我试图从谷歌搜索页面中提取每篇报纸文章的日期: https://www.google.com/search?q=citi+group&tbm=nws&ei=u9_1WsetC67l5gKRt7qYBA&start=0&sa=N&biw=1600&bih=794&dpr=1
这是我的代码:
from bs4 import BeautifulSoup
import requests
article_link = "https://www.google.com/search?q=citi+group&tbm=nws&ei=u9_1WsetC67l5gKRt7qYBA&start=0&sa=N&biw=1600&bih=794&dpr=1"
page = requests.get(article_link)
soup = BeautifulSoup(page.content, 'html.parser')
for links in soup.find_all('div', {'class':'slp'}):
date = links.get_text()
print(date)
源代码如下:
输出" PE Hub(博客) - 1天前"
我可以提取日期部分(2018. 5. 11)吗?
答案 0 :(得分:0)
不确定为什么BeautifulSoup会以这种方式拉动它,但你可以使用正则表达式和日期时间来清理你所拉的东西然后你可以清理它并使用timedelta
否则使用strptime
进行转换它的格式正确。
from bs4 import BeautifulSoup
import requests
hold = []
article_link = "https://www.google.com/search?q=citi+group&tbm=nws&ei=u
9_1WsetC67l5gKRt7qYBA&start=0&sa=N&biw=1600&bih=794&dpr=1"
page = requests.get(article_link)
soup = BeautifulSoup(page.content, 'html.parser')
for links in soup.find_all('div', {'class':'slp'}):
date = links.get_text()
hold.append(date) #added list append
---------
#converting to datetime values
import re
from datetime import datetime as dt
hold2 = []
for item in hold:
item = re.sub('^.+ - ','', item)
if 'ago' in item:
item = re.sub(' days? ago$','',item)
hold2.append(dt.today() - timedelta(int(item)))
else:
item = dt.strptime(item, '%b %d, %Y')
hold2.append(item)
hold2
[datetime.datetime(2018, 5, 12, 14, 37, 39, 653618),
datetime.datetime(2018, 5, 8, 14, 37, 39, 653636),
datetime.datetime(2018, 5, 11, 14, 37, 39, 653643),
datetime.datetime(2018, 5, 12, 14, 37, 39, 653649),
datetime.datetime(2018, 5, 8, 14, 37, 39, 653655),
datetime.datetime(2018, 5, 12, 14, 37, 39, 653661),
datetime.datetime(2018, 5, 12, 14, 37, 39, 653667),
datetime.datetime(2018, 4, 24, 0, 0),
datetime.datetime(2018, 5, 8, 14, 37, 39, 653716),
datetime.datetime(2018, 4, 25, 0, 0)]