我正在尝试在数据库中抓取电子商务网站,以根据某些关键字的外观对用户进行细分。
我正在将Google Colab和Pandas库与请求库一起使用。
但是,太慢了。它将100个网站刮掉293秒。
有没有改善的方法?
这是我的代码
start = timeit.default_timer()
for url in Account["url"][:100]:
try:
url = "https://" + url
page = requests.get(url)
contents = page.content
if len(re.findall(key4, contents)) < 1 and len(re.findall(key3, contents)) > 0:
if len(re.findall(key1, contents)) > 50 or len(re.findall(key2, contents)) > 50:
products_found = len(re.findall(key1, contents))
collection_found = len(re.findall(key2, contents))
shopping_stores_df = shopping_stores_df.append({'url': url, 'products': products_found, 'collections': collection_found}, ignore_index=True)
shopping_stores_df.loc[shopping_stores_df['url'] == url, ['ranking', 'people', 'emails', 'tel']] = df.loc[df['Location on Site'] == url[8:], ['Alexa', 'People', 'Emails', 'Telephones']].values
except: pass
stop = timeit.default_timer()
print('Execution time:', start-stop)
shopping_stores_df
谢谢!