如何加快网页抓取?

时间:2020-07-16 12:16:42

标签: python web-scraping threadpool concurrent.futures

我正在处理网上报废公司的数据。我的代码太慢了,所以我正在尝试并发期货。我不知道自己缺少什么。这是输入文本文件3_1.txt:

AQ VENTURA PVT. LTD.
AQLU LEARNING PVT LTD
Aqquarate Solutions
Aqua Centric Pvt Ltd
AQUA EASY INFO TECH
Aqua Filmtec Pvt ltd
aqua sms
Aqua Soft Water Systems
AQUA SPA

这是我的代码:

import requests
import pandas as pd
from bs4 import BeautifulSoup
import concurrent.futures

headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:65.0) Gecko/20100101 Firefox/65.0'}

with open('3_1.txt', 'r') as f_in:
    companies = [line.strip() for line in f_in if line.strip()]

all_data = []
threads = 2

for company in companies:
    print(company)


def data(company):
    soup = BeautifulSoup(
        requests.get('https://google.com/search', params={'q': company, 'hl': 'en'}, headers=headers).content,
        'html.parser')
    address = soup.select_one('.LrzXr')
    if address:
        address = address.text
    else:
        address = 'Not Found'
    phone = soup.select_one('.LrzXr.zdqRlf.kno-fv')
    if phone:
        phone = phone.text
    else:
        phone = 'Not Found'

    all_data.append({"Company": company, "Address": address, "Phone": phone})


with concurrent.futures.ThreadPoolExecutor(max_workers=threads) as executor:
    executor.map(data, company)

df = pd.DataFrame(all_data)
df.to_csv('Companydata.csv')

我得到了输出,但它并不重要。 请帮我。谢谢。

0 个答案:

没有答案
相关问题