Question

我已经使用脚本在本地运行Selenium，以便可以利用蜘蛛中的响应（来自Selenium）。

这是selenium在本地运行的Web服务：

from flask import Flask, request, make_response
from flask_restful import Resource, Api
from selenium import webdriver
from selenium.webdriver.chrome.options import Options

app = Flask(__name__)
api = Api(app)

class Selenium(Resource):
    _driver = None

    @staticmethod
    def getDriver():
        if not Selenium._driver:
            chrome_options = Options()
            chrome_options.add_argument("--headless")

            Selenium._driver = webdriver.Chrome(options=chrome_options)
        return Selenium._driver

    @property
    def driver(self):
        return Selenium.getDriver()

    def get(self):
        url = str(request.args['url'])

        self.driver.get(url)

        return make_response(self.driver.page_source)

api.add_resource(Selenium, '/')

if __name__ == '__main__':
    app.run(debug=True)

这是我的scrap脚蜘蛛，它利用该响应从网页中解析标题。

import scrapy
from urllib.parse import quote
from scrapy.crawler import CrawlerProcess

class StackSpider(scrapy.Spider):
    name = 'stackoverflow'
    url = 'https://stackoverflow.com/questions/tagged/web-scraping?sort=newest&pageSize=50'
    base = 'https://stackoverflow.com'

    def start_requests(self):
        link = 'http://127.0.0.1:5000/?url={}'.format(quote(self.url))
        yield scrapy.Request(link,callback=self.parse)

    def parse(self, response):
        for item in response.css(".summary .question-hyperlink::attr(href)").getall():
            nlink = self.base + item
            link = 'http://127.0.0.1:5000/?url={}'.format(quote(nlink))
            yield scrapy.Request(link,callback=self.parse_info,dont_filter=True)

    def parse_info(self, response):
        item = response.css('h1[itemprop="name"] > a::text').get()
        yield {"title":item}

if __name__ == '__main__':
    c = CrawlerProcess()
    c.crawl(StackSpider)
    c.start()

问题是上述脚本多次给我相同的标题，然后又给了我另一个标题，依此类推。

我应该带来什么麻烦才能使脚本以正确的方式工作？

Answer 1

我同时运行了两个脚本，它们均按预期运行。所以我的发现：

downloader/exception_type_count/twisted.internet.error.ConnectionRefusedError未经服务器（即ebay）的许可，无法解决此错误。
scrapy的日志：

2019-05-25 07:28:41 [scrapy.statscollectors]信息：倾销Scrapy统计信息： {'downloader / exception_count'：72， 'downloader / exception_type_count / twisted.internet.error.ConnectionRefusedError'：64， 'downloader / exception_type_count / twisted.web._newclient.ResponseNeverReceived'：8， 'downloader / request_bytes'：55523， “ downloader / request_count”：81， 'downloader / request_method_count / GET'：81， 'downloader / response_bytes'：2448476， 'downloader / response_count'：9， 'downloader / response_status_count / 200'：9， 'finish_reason'：'关机'， 'finish_time'：datetime.datetime（2019，5，25，1，58，41，234183）， 'item_scraped_count'：8， 'log_count / DEBUG'：90， 'log_count / INFO'：9， 'request_depth_max'：1， 'response_received_count'：9 “重试/计数”：72， 'retry / reason_count / twisted.internet.error.ConnectionRefusedError'：64， 'retry / reason_count / twisted.web._newclient.ResponseNeverReceived'：8， “调度程序/出队”：81， “调度程序/出队/内存”：81， ``调度程序/排队''：131， “调度程序/排队/内存”：131， 'start_time'：datetime.datetime（2019，5，25，1，56，57，751009）} 2019-05-25 07:28:41 [scrapy.core.engine]信息：蜘蛛关闭（关闭）

您只能看到8个项目被抓取。这些只是徽标和其他不受限制的东西。

Server Log：

s：// .ebaystatic.com http：// .ebay.com https：//*.ebay.com”。要么是'unsafe-inline'关键字，要么是哈希（ sha256-40GZDfucnPVwbvI / Q1ivGUuJtX8krq8jy3tWNrA / n58 ='），或者需要随机数（'nonce -...'）才能启用内联执行。 ”，来源：https://vi.vipr.ebaydesc.com/ws/eBayISAPI.dll?ViewItemDescV4&item=323815597324&t=0&tid=10&category=169291&seller=wardrobe-ltd&excSoj=1&excTrk=1&lsite=0&ittenable=false&domain=ebay.com&descgauge=1&cspheader=1&oneClk=1&secureDesc=1（1）

Ebay不允许您报废。

那么如何完成任务>>

在每次抓取之前都要对同一站点进行/robots.txt检查。对于ebay，其http://www.ebay.com/robots.txt 您会看到几乎所有的东西都是不允许的。

用户代理：* 禁止：/ * rt = nc 禁止：/ b / LH_ 禁止：/ brw / 禁止：/ clp / 禁止：/ clt / store / 禁止：/ csc / 禁止：/ ctg / 禁止：/ ctm / 禁止：/ dsc / 禁止：/ edc / 禁止：/ feed / 禁止：/ gsr / 禁止：/ gwc / 禁止：/ hcp / 禁止：/ itc / 禁止：/ lit / 禁止：/ lst / ng / 禁止：/ lvx / 禁止：/ mbf / 禁止：/ mla / 禁止：/ mlt / 禁止：/ myb / 禁止：/ mys / 禁止：/ prp / 禁止：/ rcm / 不允许：/ sch / ％7C 禁止：/ sch / * LH_ 禁止：/ sch / aop / 禁止：/ sch / ctg / 不允许：/ sl / node 禁止：/ sme / 禁止：/ soc / 禁止：/ talk / 不允许：/票证/ 禁止：/今天/ 禁止：/ trylater / 不允许：/ urw / write-review / 禁止：/ vsp / 禁止：/ ws / 禁止：/ sch / * modules = SEARCH_REFINEMENTS_MODEL_V2 禁止：/ b / modules = SEARCH_REFINEMENTS_MODEL_V2 禁止：/ itm / _nkw 禁止：/ itm / ？适合禁止：/ itm / ＆fits 禁止：/ cta /
因此，请转到https://developer.ebay.com/api-docs/developer/static/developer-landing.html并检查他们的文档，他们的站点中有更简单的示例代码，可以在不刮擦的情况下获得物品的需求。

无法使我的脚本以正确的方式处理本地创建的服务器响应

1 个答案: