我正在针对VS Code中的刮擦飞溅进行教程练习。我无法解析http://quotes.toscrape.com/js/上的Javascript。我在本地主机8050上运行了Splash,我使用以下命令将它从docker中拉出:
docker run -p 8050:8050 scrapinghub/splash --disable-private-mode
scrapy-splash安装在scrapy项目的根目录中。
Settings.py是:
SPLASH_URL = 'http://localhost:8050'
DOWNLOADER_MIDDLEWARES = {
'scrapy_splash.SplashCookiesMiddleware': 723,
'scrapy_splash.SplashMiddleware': 725,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}
SPIDER_MIDDLEWARES = {
'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'
我在启用和未启用私人浏览的情况下停止并重新启动了Docker映像。我在启动请求中添加了一个等待,什么也没有。我已经从头开始构建了Spider,并在示例中进行了复制。
代码在HTML页面上可以正常工作,但是当我通过SplashRequest使用JS版本时却什么也没有。这是抓取的“ hello world”,我真的很想知道我在哪里/做错了什么。我怀疑它如此愚蠢,显而易见,但是我看不到我在哪里或做错了什么。我正在使用VS Code,所以也许我的设置中有某种原因导致这种情况,但是我正在使用venv。
python
import scrapy
from scrape_douglas.scrape_douglas.items import QuoteItem
from scrapy.selector import Selector
from scrapy_splash import SplashRequest
class DougrSpider(scrapy.Spider):
name = 'dougr'
allowed_domains = ['toscrape.com']
start_urls = ["http://quotes.toscrape.com/js/"]
def start_requests(self):
for url in self.start_urls:
yield SplashRequest(url=url, callback=self.parse, endpoint='render.html')
def parse(self, response):
for quote in response.css("div.quote"):
item ={'author':quote.css("small.author::text").extract_first(),
'text':quote.css("span.text::text").extract_first(),
'tag':quote.css("a.tag::text").extract()}
yield item
这是日志文件:
2019-08-14 15:08:28+0000 [-] Log opened.
2019-08-14 15:08:28.371179 [-] Splash version: 3.3.1
2019-08-14 15:08:28.371458 [-] Qt 5.9.1, PyQt 5.9.2,
WebKit 602.1,sip 4.19.4, Twisted 18.9.0, Lua 5.2
2019-08-14 15:08:28.371539 [-] Python 3.5.2
(default, Nov 12 2018,13:43:14) [GCC 5.4.0 20160609]
2019-08-14 15:08:28.371611 [-] Open files limit: 1048576
2019-08-14 15:08:28.371658 [-] Can't bump open files limit
2019-08-14 15:08:28.474871 [-] Xvfb is started:
['Xvfb', ':920282986', '-screen', '0',
'1024x768x24', '-nolisten', 'tcp']
QStandardPaths: XDG_RUNTIME_DIR not set,
defaulting to '/tmp/runtimee-root' r
2019-08-14 15:08:28.538650 [-] proxy profiles support is enabled,
p roxy profiles path: /etc/splash/proxy-profiles e
2019-08-14 15:08:28.538921 [-] memory cache: enabled,
private mode:o disabled, js cross-domain access: disabled
2019-08-14 15:08:28.632635 [-] verbosity=1, slots=20,
argument_cach.e_max_entries=500, max-timeout=90.0
2019-08-14 15:08:28.633557 [-] Web UI: enabled,
Lua: enabled (sandbox: enabled)nabled)
2019-08-14 15:08:28.633904 [-] Site starting on 8050
object at 0x7f75cf214
2019-08-14 15:08:28.633998 [-] Starting factory
<twisted.web.server.Site object at 0x7f75cf214cc0>
2019-08-14 15:08:28.634273 [-] Server listening on
http://0.0.0.0:8050 000] "GET / HTTP/1.1"
2019-08-14 15:52:58.379703 [-] "xxxx.xx.x.x" - -
[14/Aug/2019:15:52:57 +00100101 Firefox/69.0"000]
"GET / HTTP/1.1" 200 7679 "-" "Mozilla/5.0 (Windows NT 10.0; Win64;000]
"GET /_ui/style.c x64; rv:69.0)
Gecko/20100101 Firefox/69.0"10.0; Win64; x64; rv:6
2019-08-14 15:52:58.415976 [-] "xxx.xx.x.x" - -
[14/Aug/2019:15:52:57 +0000]
"GET /_ui/style.css HTTP/1.1" 200 2591 "http://localhost:8050/"
"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:69.0) Gecko/20100101 Firefox/69.0"