我正在尝试编写一个脚本,从特定的website获取作业详细信息。当我从Google Chrome中的源代码(命令选项-U)与开发人员工具(命令选项-I)查看时,html代码似乎有所不同。开发人员工具具有我可以用HTML解析的实际细节。
我所追求的一个例子是在网站发布的第一份工作中找到的:
Canada-Alberta-Fort McMurray,Canada-Alberta-Edmonton
我知道我需要使用POST提交表单,但除此之外我无法获取开发人员工具中的html代码,但我的请求中没有。
import requests
url='https://caterpillar.taleo.net/careersection/cat+external+cs/jobsearch.ftl?lang=en&portal=4140124208&src=CWS-10005'
r = requests.post(url, data={'dropListSize': 100})
print(r.status_code, r.reason)
html=r.text
我也尝试过使用mechanize
的类似策略import mechanize
br = mechanize.Browser()
br.open(url)
for f in br.forms():
print f
br.select_form('ftlform')
br.form["dropListSize"] = ["100"]
br.submit()
html=br.response().read()
一个相关的问题是我如何进入下一页,但我觉得我可能能够弄明白。
答案 0 :(得分:2)
发送到https://caterpillar.taleo.net/careersection/cat+external+cs/jobsearch.ajax
端点的XHR POST请求包含响应中的所有搜索结果。您可以尝试模拟它(我怀疑通过参数数量和响应格式判断不会很有趣),或者您可以通过selenium
在真实浏览器中加载页面,让浏览器加载页面并且不用担心搜索结果的传递方式。
使用selenium
+ PhantomJS
无头浏览器的工作示例:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
url = 'https://caterpillar.taleo.net/careersection/cat+external+cs/jobsearch.ftl?lang=en&portal=4140124208&src=CWS-10005'
driver = webdriver.PhantomJS()
driver.get(url)
wait = WebDriverWait(driver, 10)
table = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "table.contentlist")))
for row in table.find_elements_by_css_selector("tr.ftlrow"):
title = row.find_element_by_css_selector(".titlelink a").text
print(title)
driver.close()
打印:
Sales accountant
Manufacturing Project Engineer
Staff Accountant - Accountable
Hydraulic Cylinder Design Engineer
Engineering Supervisor(Hydraulic Cylinder)
Design Engineer
Senior Design Engineer
Senior Engineer
Senior Design Engineer
Dealer Solution Network (DSN) Analyst