Question

页面来源

<html>
<title>Example Web</title>
<script>

$(document).ready(function(){
    document.getElementById('output').value = "Hi There""
}
)

</script>

<body>
<div id='output'></div>
</body>
</html>

正如预期的那样，加载时的Page Dom将是：

<html>
<title>Example Web</title>
<script>

$(document).ready(function(){
    document.getElementById('output').value = "Hi There"
}
)

</script>

<body>
<div id='output'>Hi There</div>
</body>
</html>

似乎在使用Scrapy抓取网站时，响应是页面源，而不是页面DOM。如何让scrapy请求Page DOM以便我可以提取正文中的“Hi There”字符串？

Answer 1

您无法让Scrapy请求Page DOM而不是Page Source ，因为Scrapy不是浏览器。所以，它无法呈现Javascript。它只是根据它得到的响应构建一个元素树。

参考Google Group discussion on Scrapy supporting Javascript

1：https://groups.google.com/forum/#!topic/scrapy-users/tOVH-X7H3DI和Another StackOverflow discussion on the same topic

但是，您可以考虑使用ScrapyJS MiddleWare之外的ScrapingHub。

Scrapy：加载＆＃39; ready-ed＆＃39; DOM而不是Source

1 个答案: