Scrapy程序在约500-600件

时间:2017-07-19 20:24:24

标签: python python-2.7 selenium web-scraping scrapy

我聘请了一位自由职业者为ebay制作这只蜘蛛。我只是学习python并且需要它快速,所以这是我们的解决方案。我已经让他弄清楚是什么导致了这个问题,他已经尝试了,但他不知道这是什么问题。他说它在他的电脑上运行正常,但它在我尝试的所有电脑上崩溃了。我使用过Ubuntu和Windows。不同的计算机和有互联网的位置。

它是围绕python 2.7构建的,我确保我拥有最新版本的scrapy,selenium和bs4。

我正在为ebay运行scrapy蜘蛛。当我运行它时,它将通过大约500-600个项目,然后关闭蜘蛛。

程序解析页面并提取数据并将其保存到CSV文件中,没有任何问题。它从一页到下一页也很好。

有时它会像这样关闭:

2017-07-19 15:27:21 [scrapy.downloadermiddlewares.retry] DEBUG: Gave up 
retrying <GET ***link***> (failed 3 times): [<twisted.python.failure.Failure 
twisted.internet.error.ConnectionLost: Connection to the other side was lost 
in a non-clean fashion: Connection lost.>]

2017-07-19 15:27:21 [scrapy.core.scraper] ERROR: Error downloading <GET ***link***>: [<twisted.python.failure.Failure twisted.internet.error.ConnectionLost: Connection to the other side was lost in a non-clean fashion: Connection lost.>]

其他时候它会像这样关闭而没有理由:

2017-07-19 13:20:01 [scrapy.core.engine] INFO: Closing spider (finished)

我给它的起始链接有50,000个部分,所以让它停在500左右就是缩短它。

以下是代码的一部分:

import scrapy,time
from urlparse import urljoin
from selenium import webdriver
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from bs4 import BeautifulSoup
from selenium.webdriver.support.ui import WebDriverWait as wait
from selenium.webdriver.support.ui import Select
from ebay.items import EbayItem
url = raw_input("ENTER THE URL TO SCRAPE : ")
co = 1
ch = 1

class EbayspiderSpider(scrapy.Spider):
name = "ebayspider"
#start_urls = ['http://www.ebay.com/sch/hfinney/m.html?
item=132127244893&rt=nc&_trksid=p2047675.l2562']
start_urls = [str(url)]


def __init__(self):
    self.driver = webdriver.Chrome()

def __del__(self):
    self.driver.quit()


def parse(self, response):
    global ch, co

    try:
        if ch > 5:
            self.driver.quit()
            self.driver = webdriver.Chrome()
            ch = 1
            co = 1

        for attr in response.xpath('//*[@id="ListViewInner"]/li'):
            item = EbayItem( )
            linkse = '.vip ::attr(href)'
            link = attr.css('a.vip ::attr(href)').extract_first()
            yield scrapy.Request(urljoin(response.url, link), callback=self.parse_link, meta={'item': item})
        next_page = '.gspr.next ::attr(href)'
        next_page = response.css(next_page).extract_first()
        if next_page:
            ch+=1
            yield scrapy.Request(urljoin(response.url, next_page), callback=self.parse)

    except:

        if ch > 5:
            self.driver.quit()
            self.driver = webdriver.Chrome()
            ch = 1
            co = 1


        SET_SELECTOR = '.li.nol'
        for attr in response.css(SET_SELECTOR):
            item = EbayItem()
            linkse = '.v4lnk ::attr(href)'
            link = attr.css(linkse).extract_first()
            yield scrapy.Request(urljoin(response.url, link), callback=self.parse_link, meta={'item': item})

        next_page = 'td.next a ::attr(href)'
        next_page = response.css(next_page).extract_first()
        if next_page:
            ch+=1
            yield scrapy.Request(urljoin(response.url, next_page), callback=self.parse)

欢迎任何关于这个问题的想法,我会尝试任何事情。

当蜘蛛关闭时,有没有办法从它停止的地方重新打开它?

我不是最好的python,但我已经学到了很多想要解决这个问题,我会尝试任何事情。

感谢您的帮助!

1 个答案:

答案 0 :(得分:0)

我认为您可以从Ebay Api获取该信息