Question

我不明白为什么以下不起作用。我知道有相关的答案，但他们没有帮助我。

$ scrapy shell "http://edition.cnn.com"

内部有一个带有“CNN Money”的h2标签。为什么以下不起作用？

>>> response.xpath('//h2[contains(string(), "CNN Money")]')
[]

我也试过text()

>>> response.xpath('//h2[contains(text(), "CNN Money")]')
[]

Answer 1

这不是您使用的XPath表达式。问题是页面内容是动态提供的，例如通过一些JavaScript。检查自己 - 尝试在页面源代码中搜索 CNN Money 。你不会发现任何打击。您需要呈现页面并解析输出。为此，我建议您将Splash与scrapy-splash库一起使用。

修改

使用此命令运行Splash：

docker run -d -p 8050:8050 --restart=always scrapinghub/splash --max-timeout 3600

它会增加请求的最大超时时间。（您可以查看documentation关于如何在生产中运行Splash的其他选项。）您还需要将timeout参数中的args字段增加到SplashRequest，例如

yield scrapy_splash.SplashRequest(url, self.parse, endpoint='render.json', args={'timeout': 3600})

使用scrapy和xpath刮取具有特定文本的节点

1 个答案: