我聘请了一位自由职业者为ebay制作这只蜘蛛。我只是学习python并且需要它快速,所以这是我们的解决方案。我已经让他弄清楚是什么导致了这个问题,他已经尝试了,但他不知道这是什么问题。他说它在他的电脑上运行正常,但它在我尝试的所有电脑上崩溃了。我使用过Ubuntu和Windows。不同的计算机和有互联网的位置。
它是围绕python 2.7构建的,我确保我拥有最新版本的scrapy,selenium和bs4。
我正在为ebay运行scrapy蜘蛛。当我运行它时,它将通过大约500-600个项目,然后关闭蜘蛛。
程序解析页面并提取数据并将其保存到CSV文件中,没有任何问题。它从一页到下一页也很好。
有时它会像这样关闭:
2017-07-19 15:27:21 [scrapy.downloadermiddlewares.retry] DEBUG: Gave up
retrying <GET ***link***> (failed 3 times): [<twisted.python.failure.Failure
twisted.internet.error.ConnectionLost: Connection to the other side was lost
in a non-clean fashion: Connection lost.>]
2017-07-19 15:27:21 [scrapy.core.scraper] ERROR: Error downloading <GET ***link***>: [<twisted.python.failure.Failure twisted.internet.error.ConnectionLost: Connection to the other side was lost in a non-clean fashion: Connection lost.>]
其他时候它会像这样关闭而没有理由:
2017-07-19 13:20:01 [scrapy.core.engine] INFO: Closing spider (finished)
我给它的起始链接有50,000个部分,所以让它停在500左右就是缩短它。
以下是代码的一部分:
import scrapy,time
from urlparse import urljoin
from selenium import webdriver
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from bs4 import BeautifulSoup
from selenium.webdriver.support.ui import WebDriverWait as wait
from selenium.webdriver.support.ui import Select
from ebay.items import EbayItem
url = raw_input("ENTER THE URL TO SCRAPE : ")
co = 1
ch = 1
class EbayspiderSpider(scrapy.Spider):
name = "ebayspider"
#start_urls = ['http://www.ebay.com/sch/hfinney/m.html?
item=132127244893&rt=nc&_trksid=p2047675.l2562']
start_urls = [str(url)]
def __init__(self):
self.driver = webdriver.Chrome()
def __del__(self):
self.driver.quit()
def parse(self, response):
global ch, co
try:
if ch > 5:
self.driver.quit()
self.driver = webdriver.Chrome()
ch = 1
co = 1
for attr in response.xpath('//*[@id="ListViewInner"]/li'):
item = EbayItem( )
linkse = '.vip ::attr(href)'
link = attr.css('a.vip ::attr(href)').extract_first()
yield scrapy.Request(urljoin(response.url, link), callback=self.parse_link, meta={'item': item})
next_page = '.gspr.next ::attr(href)'
next_page = response.css(next_page).extract_first()
if next_page:
ch+=1
yield scrapy.Request(urljoin(response.url, next_page), callback=self.parse)
except:
if ch > 5:
self.driver.quit()
self.driver = webdriver.Chrome()
ch = 1
co = 1
SET_SELECTOR = '.li.nol'
for attr in response.css(SET_SELECTOR):
item = EbayItem()
linkse = '.v4lnk ::attr(href)'
link = attr.css(linkse).extract_first()
yield scrapy.Request(urljoin(response.url, link), callback=self.parse_link, meta={'item': item})
next_page = 'td.next a ::attr(href)'
next_page = response.css(next_page).extract_first()
if next_page:
ch+=1
yield scrapy.Request(urljoin(response.url, next_page), callback=self.parse)
欢迎任何关于这个问题的想法,我会尝试任何事情。
当蜘蛛关闭时,有没有办法从它停止的地方重新打开它?
我不是最好的python,但我已经学到了很多想要解决这个问题,我会尝试任何事情。
感谢您的帮助!
答案 0 :(得分:0)
我认为您可以从Ebay Api获取该信息