我正在尝试抓取一个网站,其中包含javascript代码和准备javascript代码的网站内容。
安装了Scrapy和Splash。
Splash正在运行此代码
sudo docker run -p 8050:8050 -v /etc/splash/proxy-profiles:/etc/splash/proxy-profiles scrapinghub/splash
2015-08-21 07:21:06+0000 [-] Log opened.
2015-08-21 07:21:06.483344 [-] Splash version: 1.7
2015-08-21 07:21:06.490230 [-] Qt 4.8.1, PyQt 4.9.1, WebKit 534.34, sip 4.13.2, Twisted 15.2.1, Lua 5.2
2015-08-21 07:21:06.490505 [-] Open files limit: 524288
2015-08-21 07:21:06.490745 [-] Open files limit increased from 524288 to 1048576
2015-08-21 07:21:06.699607 [-] Xvfb is started: ['Xvfb', ':1087', '-screen', '0', '1024x768x24']
2015-08-21 07:21:06.808450 [-] proxy profiles support is enabled, proxy profiles path: /etc/splash/proxy-profiles
2015-08-21 07:21:06.929580 [-] verbosity=1
2015-08-21 07:21:06.929964 [-] slots=50
2015-08-21 07:21:06.930484 [-] Web UI: enabled, Lua: enabled (sandbox: enabled), Proxy Server: enabled
2015-08-21 07:21:06.931420 [-] Site starting on 8050
2015-08-21 07:21:06.931640 [-] Starting factory <twisted.web.server.Site instance at 0x1b5b3f8>
2015-08-21 07:21:06.938232 [-] SplashProxyServerFactory starting on 8051
2015-08-21 07:21:06.938468 [-] Starting factory <splash.proxy_server.SplashProxyServerFactory instance at 0x1b5bcf8>
当我想获取网站代码时,render.html显示“未启用Javascript。请在浏览器中启用JavaScript”。
import scrapy
class xxxxxSpider(scrapy.Spider):
start_urls = ["xxxxx"]
name = "sahibinden"
def start_requests(self):
for url in self.start_urls:
yield scrapy.Request(url, self.parse, meta={
'splash': {
'endpoint': 'render.html',
'args': {'wait': 0.5, 'proxy':'xxxxx'}
}
})
def parse(self, response):
with open("result.txt", "a") as myfile:
myfile.write(str(response.css('body').extract()))
所有设置都可以。
DOWNLOADER_MIDDLEWARES = {
'scrapyjs.SplashMiddleware': 725,
}
SPLASH_URL = 'http://localhost:8050/'
DUPEFILTER_CLASS = 'scrapyjs.SplashAwareDupeFilter'
HTTPCACHE_STORAGE = 'scrapyjs.SplashAwareFSCacheStorage'
我成功取消了一次网站。然后我收到“浏览器中未启用Javascript”错误。
如果它有助于解决问题,那么当我渲染页面时,这就是泼溅输出。
2015-08-21 08:06:09.838076 [-] "172.17.42.1" - - [21/Aug/2015:08:06:09
+0000] "POST /render.html HTTP/1.1" 200 4048 "-" "Scrapy/1.0.3.post1+g83a06ed (+http://scrapy.org)"
我无法理解这是什么问题。有什么帮助吗?
更多信息
我删除了虚拟机。 IP地址已更改,然后我再次尝试。它首次成功获得了结果。但是,第二次请求无法得到任何东西。我认为该网站阻止了我的IP地址。