以下代码是使用“https://github.com/GeneralMills/pytrends”中非正式API的Google Trend Crawler。我的代码运行正常,但一个问题是没有人知道Google Trend Crawler的限制。因此,如果我使用2000或更多“DNA”列表运行我的Crawler,那么我有错误说我已超出请求限制。如果我超过了限制,那么我在限制之前的所有已爬网数据都将丢失,因为我在代码末尾写入csv。有没有办法将我的数据写入每个循环的csv,所以即使我通过了限制,至少我在达到限制之前有数据?感谢
from pytrends.request import TrendReq
from datetime import datetime
import pandas as pd
import time
import xlsxwriter
pytrends = TrendReq(hl='en-US,tz=360')
Data = pd.DataFrame()
#for loop check writer path
path = "C:/Users/aijhshin/Workk/GoogleTrendCounter.txt"
#file = open(path,"a")
#setting index using 'apple' keyword
kw_list = ['apple']
pytrends.build_payload(kw_list, cat=0, timeframe='today 5-y', geo='', gprop='')
Googledate = pd.DataFrame(pytrends.interest_over_time())
Data['Date'] = Googledate.index
#Google Trend Crawler limit = 1600 request per day
for i in range(len(DNA)):
kw_list = [DNA[i]]
pytrends.build_payload(kw_list, cat=0, timeframe='today 5-y', geo='', gprop='')
#results
df = pd.DataFrame(pytrends.interest_over_time())
if(df.empty == True):
Data[DNA[i]] = ""
else:
df.index.name = 'Date'
df.reset_index(inplace=True)
Data[DNA[i]] = df.loc[:, DNA[i]]
#test for loop process
file = open(path,"a")
file.write(str(i) + " " + str(datetime.now()) + " ")
file.write(DNA[i] +'\n')
file.close()
#run one per nine second (optional)
#time.sleep(9)
#writing csv file (overwrite each time)
Data.to_csv('Google Trend.csv')
print("Crawling Done")
答案 0 :(得分:2)
在Data.to_csv('Google Trend.csv')
之后移动time.sleep(9)
并将其更改为a
time.sleep(9)
Data.to_csv('Google Trend.csv', mode='a')
模式a
将附加到csv
文件的末尾,而不是覆盖它。