webscraping和提取日期

时间:2018-05-13 18:19:27

标签: python

使用python BeautifulSoup,我试图从谷歌搜索页面中提取每篇报纸文章的日期: https://www.google.com/search?q=citi+group&tbm=nws&ei=u9_1WsetC67l5gKRt7qYBA&start=0&sa=N&biw=1600&bih=794&dpr=1

这是我的代码:

from bs4 import BeautifulSoup
import requests

article_link = "https://www.google.com/search?q=citi+group&tbm=nws&ei=u9_1WsetC67l5gKRt7qYBA&start=0&sa=N&biw=1600&bih=794&dpr=1"

page = requests.get(article_link)    
soup = BeautifulSoup(page.content, 'html.parser')

for links in soup.find_all('div', {'class':'slp'}):
    date = links.get_text()
    print(date)

源代码如下:

enter image description here

输出" PE Hub(博客) - 1天前"

我可以提取日期部分(2018. 5. 11)吗?

1 个答案:

答案 0 :(得分:0)

不确定为什么BeautifulSoup会以这种方式拉动它,但你可以使用正则表达式和日期时间来清理你所拉的东西然后你可以清理它并使用timedelta否则使用strptime进行转换它的格式正确。

from bs4 import BeautifulSoup
import requests

hold = []
article_link = "https://www.google.com/search?q=citi+group&tbm=nws&ei=u
 9_1WsetC67l5gKRt7qYBA&start=0&sa=N&biw=1600&bih=794&dpr=1"

page = requests.get(article_link)
soup = BeautifulSoup(page.content, 'html.parser')

for links in soup.find_all('div', {'class':'slp'}):
     date = links.get_text()
     hold.append(date) #added list append

---------

#converting to datetime values
import re
from datetime import datetime as dt
hold2 = []
for item in hold:
     item  = re.sub('^.+ - ','', item)
     if 'ago' in item:
         item = re.sub(' days? ago$','',item)
         hold2.append(dt.today() - timedelta(int(item)))
     else:
         item = dt.strptime(item, '%b %d, %Y')
         hold2.append(item)

hold2
[datetime.datetime(2018, 5, 12, 14, 37, 39, 653618),
 datetime.datetime(2018, 5, 8, 14, 37, 39, 653636),
 datetime.datetime(2018, 5, 11, 14, 37, 39, 653643),
 datetime.datetime(2018, 5, 12, 14, 37, 39, 653649),
 datetime.datetime(2018, 5, 8, 14, 37, 39, 653655),
 datetime.datetime(2018, 5, 12, 14, 37, 39, 653661),
 datetime.datetime(2018, 5, 12, 14, 37, 39, 653667),
 datetime.datetime(2018, 4, 24, 0, 0),
 datetime.datetime(2018, 5, 8, 14, 37, 39, 653716),
 datetime.datetime(2018, 4, 25, 0, 0)]