用dryscrape刮掉react.js网页

时间:2017-03-10 09:54:45

标签: python web-scraping dryscrape

我无法抓取使用react.js编程的主页http://www.jobs.ch。 我想将术语Business放在搜索框中并执行搜索。 Dryscrape为另一个不是react.js页面的例子工作。

如何在此搜索字段中编写术语Business

我的脚本执行时的错误消息:

ubuntu@ubuntu:~/scripts$ python jobs.py
Traceback (most recent call last):
  File "jobs.py", line 30, in <module>
    name.set("Business")
AttributeError: 'NoneType' object has no attribute 'set'

这是我的剧本:

#We will write a Python script to visit a webpage. Fill in the form and   submit the form.
#!/usr/bin/env python
# -*- coding:utf-8 -*-

import dryscrape

# make sure you have xvfb installed
dryscrape.start_xvfb()

root_url = 'http://www.jobs.ch/en/vacancies/'

if __name__ == '__main__':
# set up a web scraping session
session = dryscrape.Session(base_url = root_url)

# we don't need images
session.set_attribute('auto_load_images', False)

session.set_header('User-agent', 'Google Chrome')

# visit exact webpage which is the form in this example
session.visit('http://www.jobs.ch/en/vacancies/')

# fill in the form by taking ID of field from webdev tool
#name = session.at_xpath('//*[@data-reactid="107]')
name = session.at_xpath('//*[@data-reactid="107"]//*[@class="search-input col-sm-4 col-md-5"]')

name.set("Business")

# submit form
name.form().submit()

# save a screenshot of the web page
session.render("jobs.png")
print("Session rendered successfully!")

1 个答案:

答案 0 :(得分:1)

我认为您的xpath存在问题,但除此之外,您的会话本身配置不正确。

这一行

session = dryscrape.Session(base_url = root_url)

将网址设置为root_url,因此当您执行session.visit('http://www.jobs.ch/en/vacancies/')时,您实际上正在访问root_url和session.visit中提供的网址的连接。

如果您print session.url(),您将能够看到实际访问过的网址为http://www.jobs.ch/en/vacancies/http://www.jobs.ch/en/vacancies/

我从Chrome获取的页面的xpath - &gt;检查 - &gt;右键单击 - &gt;复制XPath是//*[@id="react-root"]/div/div[1]/div/div[2]/div/div[3]/div[2]/div/div/div/div/div[2]/div/div[1]/div/input

请确认您使用的是正确的xpath。