我正在使用BeautifulSoup刮取几个公司网站的工作岗位(我已获得许可)。它们的HTML结构略有不同,所以我创建了几个刮刀来抓取各个网站。刮刀的输出与作业过帐的网址相同。
问题
我有刮刀,它们单独工作正常 - 但为了提高效率,我希望能够同时运行它们,而不必单独运行它们。最简单的方法是什么?
刮刀1
import requests
from bs4 import BeautifulSoup
base = "http://implementconsultinggroup.com"
url = "http://implementconsultinggroup.com/career/#/1143"
req = requests.get(url).text
soup = BeautifulSoup(req,'html.parser')
links = soup.select("a")
for link in links:
if "career" in link.get("href") and 'COPENHAGEN' in link.text:
res = requests.get(base + link.get("href")).text
soup = BeautifulSoup(res,'html.parser')
title = soup.select_one("h1.page-intro__title").get_text() if
soup.select_one("h1.section__title") else ""
overview = soup.select_one("p.page-intro__longDescription").get_text()
details = soup.select_one("div.rte").get_text()
print(title, link, details)
刮刀2
import requests
from bs4 import BeautifulSoup
url =
"http://deloittedk.easycruit.com/_sp=136ecff9b65625bf.1504382903200&icid=top_"
r = requests.get(url)
soup = BeautifulSoup(r.content)
links = soup.find_all("a")
for link in links:
print "<a href='%s'>%s</a>" %(link.get("href"), link.text)