Python scrapy工作(只有一半的时间)

时间:2017-06-02 17:11:12

标签: python-2.7 scrapy screen-scraping

我创建了一个python scrapy项目来提取一些谷歌航班的价格。

我将中间件配置为使用PhantomJS而不是普通的浏览器。

class JSMiddleware(object):
    def process_request(self, request, spider):
        driver = webdriver.PhantomJS()
        try:
            driver.get(request.url)
            time.sleep(1.5)
        except e:
            raise ValueError("request url failed - \n url: {},\n error:
                  {}").format(request.url, e)
        body = driver.page_source
        #encoding='utf-8' - add to html response if necessary
        return HtmlResponse(driver.current_url, body=body,encoding='utf-8', 
               request=request)

在settings.py中我添加了:

DOWNLOADER_MIDDLEWARES = {
# key path intermediate class, order value of middleware
'scraper_module.middlewares.middleware.JSMiddleware' : 543 ,
# prohibit the built-in middleware
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware' : None ,  } `

我还创建了以下蜘蛛类:

import scrapy
from scrapy import Selector

class Gspider(scrapy.Spider):
    name = "google_spider"

    def __init__(self):
        self.start_urls = ["https://www.google.pt/flights/#search;f=LIS;t=POR;d=2017-06-18;r=2017-06-22"]
        self.prices = []
        self.links = []

    def clean_price(self, part):
        #part received as a list
        #the encoding is utf-8
        part = part[0]
        part = part.encode('utf-8')
        part = filter(str.isdigit, part)
        return part

    def clean_link(self, part):
        part = part[0]
        part = part.encode('utf-8')
        return part

    def get_part(self, var_holder, response, marker, inner_marker, amount = 1):
        selector = Selector(response)
        divs = selector.css(marker)
        for n, div in enumerate(divs):
            if n < amount:
                part = div.css(inner_marker).extract()
                if inner_marker == '::text':
                    part = self.clean_price(part)
                else:
                    part = self.clean_link(part)
                var_holder.append(part)
            else:
                break
        return var_holder

    def parse(self, response):
        prices, links = [], []
        prices = self.get_part(prices, response, 'div.OMOBOQD-d-Ab', '::text')
        print prices
        links = self.get_part(links, response, 'a.OMOBOQD-d-X', 'a::attr(href)')
        print links

问题是,我在shell中运行代码,大约一半时间我成功获得了请求的价格和链接,但另一半时间,应该包含提取数据的最终向量是空的。

执行期间我没有收到任何错误。

有没有人知道为什么会这样? 以下是命令行中的日志: successful extraction

unsuccessful extraction

1 个答案:

答案 0 :(得分:0)

Google在抓取方面有非常严格的政策。 (当你知道他们不顾一切地抓取所有网络时,这非常虚伪......)

您应该找到一个API,如之前在评论中所述,或者可能使用代理。一种简单的方法是使用Crawlera。它管理着数千个代理,因此您不必费心。我个人用它来抓取谷歌,它的工作完美。缺点是它不是免费的。