Python HTMl Scrape没有产生结果

时间:2017-03-19 17:24:57

标签: python python-3.x xpath web-scraping lxml

(Python和第一篇文章的新手)

请参阅下面的代码,但问题在于: 我试图在代码中抓取页面上所有职位的网页,但是当我打印列表时,我没有得到任何值。我尝试过使用不同的xpath来查看是否可以打印一些东西,但每次我的列表都是空的。

是否有人知道我的代码是否存在问题,或者网站结构是否存在我没​​有考虑过的问题?

提前致谢!

from lxml import html
import requests

page = requests.get("https://careers.homedepot.com/job-search-results/?location=Atlanta%2C%20GA%2C%20United%20States&latitude=33.7489954&longitude=-84.3879824&radius=15&parent_category=Corporate%2FOther")
tree = html.fromstring(page.content)

Job_Title = tree.xpath('//*[@id="widget-jobsearch-results-list"]/div/div/div/div[@class="jobTitle"]/a/text()')

print (Job_Title)

3 个答案:

答案 0 :(得分:1)

您正在寻找的信息是使用JavaScript动态生成的,而requests则只能获得初始HTML页面来源。

您可能需要使用selenium(+ chromedriver)来获取所需数据:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait as wait
from selenium.webdriver.support import expected_conditions as EC

driver = webdriver.Chrome()
driver.get("https://careers.homedepot.com/job-search-results/?location=Atlanta%2C%20GA%2C%20United%20States&latitude=33.7489954&longitude=-84.3879824&radius=15&parent_category=Corporate%2FOther")
xpath = "//a[starts-with(@id, 'job-results')]"
wait(driver, 10).until(EC.element_to_be_clickable((By.XPATH, xpath)))
jobs = [job.text for job in driver.find_elements_by_xpath(xpath)]

答案 1 :(得分:1)

尝试一个可以解析JS的库(dryscrape是一个轻量级替代品)。

这是一个代码示例

from lxml import html
import requests
import dryscrape

session = dryscrape.Session()
session.visit("https://careers.homedepot.com/job-search-results/?location=Atlanta%2C%20GA%2C%20United%20States&latitude=33.7489954&longitude=-84.3879824&radius=15&parent_category=Corporate%2FOther")
page = session.body()
tree = html.fromstring(page.content)

Job_Title = tree.xpath('//*[@id="widget-jobsearch-results-list"]/div/div/div/div[@class="jobTitle"]/a/text()')

print (Job_Title)

答案 2 :(得分:0)

该页面使用JS构建HTML(表格)。换句话说,目标块在该页面上不存在为HTML。请打开源并检查它。

<div class="entry-content-wrapper clearfix">
    <div id="widget-jobsearch-results-list"></div> # <- Target block is empty!
    <div id="widget-jobsearch-results-pages"></div>
</div>