我是Python的初学者。我刚刚编写了一个非常简单的Web爬虫,当我运行爬虫时会导致高内存使用率。我不确定代码中有什么问题,我花了很长时间但无法解决它。
抓取工具提取每个作业的链接,并从链接生成每个作业的ID。然后它通过xpath从链接读取作业标题,最后打印出所有信息。即使链接号仅为50,但每次打印出所有信息之前,它都会导致我的计算机几乎没有响应。以下是我的代码。 我刚刚添加了标题,这需要解析每个作业的链接。我的环境是Ubuntu16.04,Python3.5,Pycharm。
import requests
from lxml import etree
import re
headers = {"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Encoding": "gzip, deflate",
"Accept-Language": "en-US,en;q=0.5",
"Connection": "keep-alive",
"Host": "jobs.51job.com",
"Upgrade-Insecure-Requests": "1",
"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0"}
def generate_info(url):
html = requests.get(url, headers=headers)
html.encoding = 'GBK'
select = etree.HTML(html.text.encode('utf-8'))
job_id = re.sub('[^0-9]', '', url)
job_title=select.xpath('/html/body//h1/text()')
print(job_id,job_title)
sum_page='http://search.51job.com/jobsearch/search_result.php?fromJs=1&jobarea=070200%2C00&district=000000&funtype=0000&industrytype=00&issuedate=9&providesalary=06%2C07%2C08%2C09%2C10&keywordtype=2&curr_page=1&lang=c&stype=1&postchannel=0000&workyear=99&cotype=99°reefrom=99&jobterm=99&companysize=99&lonlat=0%2C0&radius=-1&ord_field=0&list_type=0&dibiaoid=0&confirmdate=9'
sum_html=requests.get(sum_page)
sum_select=etree.HTML(sum_html.text.encode('utf-8'))
urls= sum_select.xpath('//*[@id="resultList"]/div/p/span/a/@href')
for url in urls:
generate_info(url)