将多处理/多线程方法添加到python web scraper

时间:2019-03-12 05:21:02

标签: python web-scraping python-multiprocessing python-multithreading

我正在从网站上抓取作业结果,并将其存储在json文件中。现在,我想加快检索速度,并在互联网上进行搜索,得出的结论是,我们可以使用多处理/多线程来做到这一点。我对他们中的任何一个都不熟悉。这是我的代码,我想添加这些方法中的任何一种以使结果更容易,更快捷。我不知道该用哪个。有人可以帮忙给这个问题吗?

import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
import args

results = []
url = 'https://www.indeed.co.in/jobs?q=software+developer&l=Bengaluru,+Karnataka&start={}'
with requests.Session() as s:
   for page in range(5):
      res = s.get(url.format(page))
      soup = bs(res.content, 'lxml')
      titles = [item.text.strip() for item in soup.select('[data-tn-element=jobTitle]')]
      companies = [item.text.strip() for item in soup.select('.company')]
      data = list(zip(titles, companies))
      results.append(data)


newList = [item for sublist in results for item in sublist]
df = pd.DataFrame(newList)
df.to_json(r'data3.json')

0 个答案:

没有答案