使用Splash with Scrapy进行新闻确认btn

时间:2018-09-04 12:18:10

标签: python scrapy scrapy-splash

我正在尝试使用启动按钮,然后使用scrapy.CrawlSpider爬网。 在浏览器中执行Lua脚本可以得到预期的结果(按下按钮)。 当我使用SplashRequest时,出现下一个错误:

2018-09-04 14:58:18 [scrapy.core.scraper] ERROR: Error downloading <GET https://auto.ru/moskva/cars/used/?sort=fresh_relevance_1-desc&page=99 via http://127.0.0.1:8050/execute>
Traceback (most recent call last):
  File "/home/kpalyanichka/anaconda3/lib/python3.6/site-packages/scrapy/core/downloader/middleware.py", line 43, in process_request
    defer.returnValue((yield download_func(request=request,spider=spider)))
ValueError: invalid hostname: 'Mozilla

我的下面的代码:

飞溅:

function main(splash, args)
    splash:init_cookies(splash.args.cookies)
    local jquery_url = "http://ajax.googleapis.com/ajax/libs/jquery/2.1.3/jquery.min.js"
    assert(splash:autoload(jquery_url))
    local host =  args.proxy_host
    local port = args.proxy_port
    splash:on_request(function (request)
       request:set_proxy{host, port}
    end)
    splash:go{"https://auto.ru"}
    splash:wait(5)
    splash:runjs([[
     if($("#confirm-button")==null){
        console.log("buttons not found")
      }
      else {
        $("#confirm-button")[0].click()
        }
    ]])
    assert(splash:wait(0.5))

    return {
        html = splash:html()
        }
end

草率:

def parse_splash(self, response):
    print(response)

def parse_test(self, response):
    proxy_host = '125.141.200.45' 
    proxy_port = '80'  
    yield SplashRequest(response.url, callback=self.parse_splash,
    args={
    "url": response.url,
    "lua_source": btn_click_sc,
    "proxy_host": proxy_host,
    "proxy_port": proxy_port,
    "timeout": 60,
    "wait": 15,
    "render_all": 1,
    },
    endpoint='execute',
    cache_args=['lua_source'],
    # headers=h,
    priority=10,  dont_filter=True
    )

和库示例中的设置

SPLASH_URL = 'http://127.0.0.1:8050'

DOWNLOADER_MIDDLEWARES = {
    'scrapy_splash.SplashCookiesMiddleware': 723,
    'scrapy_splash.SplashMiddleware': 725,
    'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}

SPIDER_MIDDLEWARES = {
    'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}

也在终端中,当搜寻器工作时我没有看到任何启动日志

使用python请求,我可以成功执行此启动脚本

0 个答案:

没有答案