我正在寻找有关如何提高页面刮板性能的建议。目前,抓取页面大约需要24分钟,而运行SQL插入查询大约需要56分钟。
我希望有人可以提出一些提高速度的方法。 任何建议都将非常有帮助。
我考虑过; -多线程(但是SQL游标有点棘手,数据有点混乱) -将页面抓取和插入到不同进程的多线程处理(不确定确切如何执行此操作,而且我还担心它仍然会混淆数据) -以某种方式改进SQL查询?我不确定该如何处理。 -在本地托管SQL Server,也许这会加快插入速度?
def processData(response,i):
contentList = [response.json()['Value']['ProductList']['ProductListItems'][i]['Code'],
response.json()['Value']['ProductList']['ProductListItems'][i]['Title'],
response.json()['Value']['ProductList']['ProductListItems'][i]['ImageUrl'],
response.json()['Value']['ProductList']['ProductListItems'][i]['Rrp'],
response.json()['Value']['ProductList']['ProductListItems'][i]['Sp'],
response.json()['Value']['ProductList']['ProductListItems'][i]['Url'],
response.json()['Value']['ProductList']['ProductListItems'][i]['Category'],
datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S")]
return contentList
def search(page):
params = (
('Sort', 'Default'),
('page', page),
('category', '1073741882'),
)
response = requests.get('https://www.somewebsite.com/search/results', headers=headers, params=params)
print('Now searching Page: ', page)
return response
headers = {
'Referer': 'https://www.somewebsite.com',
'User-Agent': 'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.108 Safari/537.36',
}
params = (
('Sort', 'Default'),
('page', '1'),
('category', '1073741882'),
)
response = requests.get('https://www.somewebsite.com/search/results', headers=headers, params=params)
#how many times we gotta send requests...
results = response.json()['Value']['ProductList']['ProductListItemCount']
#now calculate how many pages we need to load, with 24 results per page
numberOfPages = math.ceil(results/24)
val = []
k = 1
while k < numberOfPages:
response = search(k)
i = 0
#24 results per page
while i < 24:
val.append(processData(response,i))
i += 1
k += 1
sql = "INSERT IGNORE INTO products (stockNumber, title, imageURL, priceWas, priceNow, pageURL, category, lastUpdated) VALUES (%s, %s, %s, %s, %s, %s, %s, %s)"
cursor.executemany(sql, val)
connection.commit()