Question

所以我试图抓住here的头条新闻。这十年都是如此。

years是一个包含

的列表

/resources/archive/us/2007.html
/resources/archive/us/2008.html
/resources/archive/us/2009.html
/resources/archive/us/2010.html
/resources/archive/us/2011.html
/resources/archive/us/2012.html
/resources/archive/us/2013.html
/resources/archive/us/2014.html
/resources/archive/us/2015.html
/resources/archive/us/2016.html

所以我的代码在这里做的是，它打开每年页面，收集所有日期链接，然后单独打开每个链接并获取所有.text并将每个标题和相应的日期作为一行添加到数据框{{ 1}}

headlines

运行需要永远，因此我只是运行for for循环而不是运行for headlines = pd.DataFrame(columns=["date", "headline"]) for y in years: yurl = "http://www.reuters.com"+str(y) response=requests.get(yurl,headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36', }) bs= BeautifulSoup(response.content.decode('ascii', 'ignore'),'lxml') days =[] links = bs.findAll('h5') for mon in links: for day in mon.next_sibling.next_sibling: days.append(day) days = [e for e in days if str(e) not in ('\n')] for ind in days: hlday = ind['href'] date = re.findall('(?!\/)[0-9].+(?=\.)', hlday)[0] date = date[4:6] + '-' + date[6:] + '-' + date[:4] print(date.split('-')[2]) yurl = "http://www.reuters.com"+str(hlday) response=requests.get(yurl,headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36', }) if response.status_code == 404 or response.content == b'': print('') else: bs= BeautifulSoup(response.content.decode('ascii', 'ignore'),'lxml') lines = bs.findAll('div', {'class':'headlineMed'}) for h in lines: headlines = headlines.append([{"date":date, "headline":h.text}], ignore_index = True)

已经过了3个小时而且还在运行。

由于我是Python新手，我不明白，我做错了什么，或者我怎么能做得更好。

/resources/archive/us/2008.html是否会永远存在，因为每次运行时都必须读取和写入更大的数据帧？

Answer 1

您正在使用此反模式：

{{1}}

相反，这样做：

{{1}}

第二个潜在的问题是您正在制作3650个网络请求。如果我在运营这样的网站，我会加入限制来减缓像你这样的刮刀。您可能会发现最好一次收集原始数据，将其存储在磁盘上，然后在第二次传递中处理它。然后，每次需要调试程序时，都不会产生3650个Web请求的成本。

Python - 提高代码速度Pandas.append

1 个答案: