如何使用硒和scrapy自动化过程?

时间:2015-05-07 13:06:33

标签: python ajax selenium scrapy

我开始知道你需要使用像硒这样的webtoolkits来自动化抓取。

我将如何点击Google Play商店中的下一个按钮,以便仅为我的大学目的抓取评论!!

import scrapy
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.selector import Selector
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from urlparse import urljoin
from selenium import webdriver
import time


class Product(scrapy.Item):
    title = scrapy.Field()


class FooSpider(CrawlSpider):
    name = 'foo'

    start_urls = ["https://play.google.com/store/apps/details?id=com.gaana&hl=en"]

    def __init__(self, *args, **kwargs):
        super(FooSpider, self).__init__(*args, **kwargs)
        self.download_delay = 0.25
        self.browser = webdriver.Chrome(executable_path="C:\chrm\chromedriver.exe")
        self.browser.implicitly_wait(60) # 

    def parse(self,response):
        self.browser.get(response.url)
        sites = response.xpath('//div[@class="single-review"]/div[@class="review-header"]')
        items = []
        for i in range(0,200):
            time.sleep(20)
            button = self.browser.find_element_by_xpath("/html/body/div[4]/div[6]/div[1]/div[2]/div[2]/div[1]/div[2]/button[1]/div[2]/div/div")
            button.click()
            self.browser.implicitly_wait(30)    
            for site in sites:
                item = Product()

                item['title'] = site.xpath('.//div[@class="review-info"]/span[@class="author-name"]/a/text()').extract()
                yield item

我已经更新了我的代码,它只是一次又一次地给我重复40项。我的for循环错了吗?

似乎正在更新的源代码未传递给xpath,这就是为什么它返回相同的40个项目

1 个答案:

答案 0 :(得分:3)

我做了类似的事情:

from scrapy import CrawlSpider
from selenium import webdriver
import time

class FooSpider(CrawlSpider):
    name = 'foo'
    allow_domains = 'foo.com'
    start_urls = ['foo.com']

    def __init__(self, *args, **kwargs):
        super(FooSpider, self).__init__(*args, **kwargs)
        self.download_delay = 0.25
        self.browser = webdriver.Firefox()
        self.browser.implicitly_wait(60)

    def parse_foo(self.response):
        self.browser.get(response.url)  # load response to the browser
        button = self.browser.find_element_by_xpath("path") # find 
        # the element to click to
        button.click() # click
        time.sleep(1) # wait until the page is fully loaded
        source = self.browser.page_source # get source of the loaded page
        sel = Selector(text=source) # create a Selector object
        data = sel.xpath('path/to/the/data') # select data
        ...
但是,最好不要等待一段固定的时间。因此,您可以使用此处描述的方法之一http://www.obeythetestinggoat.com/how-to-get-selenium-to-wait-for-page-load-after-a-click.html而不是time.sleep(1)