Scrapy response incomplete

时间:2016-02-12 19:25:34

标签: python web-scraping scrapy web-crawler

I tried to crawl the following URL using Scrapy: http://www.walgreens.com/search/results.jsp?Ntt=bounty+paper+towel

but the returned URL is not complete. Because when I do

scrapy shell the_url_above

then

view(response)

The webpage just doesn't load completely. So my question is:

  1. what is the cause of this problem? (why I didn't get a 404 but a incomplete response)
  2. what are some potential ways to handle it?

1 个答案:

答案 0 :(得分:4)

该页面的数据似乎是用javascript加载的。如果您检查页面(例如,firebug网络选项卡),您将看到一旦加载了基页,就会通过javascript加载产品,并向http://www.walgreens.com/svc/products/search发送POST请求,内容为:

{"p":"1",  # seems to be page number
"s":"15",  # page size
"sort":"relevance",
"view":"allView",
"geoTargetEnabled":false,
"q":"bounty paper towel",  # search query
"requestType":"search",
"deviceType":"desktop"}

您可以使用scrapy发送此请求:

yield Request('http://www.walgreens.com/svc/products/search',
              method='POST',
              body=<the json from above>)

你应该收到一个装满产品数据的json对象。

您甚至可以通过此链接在浏览器中查看响应: http://www.walgreens.com/svc/products/search?p=1&s=15&sort=relevance&view=allView&geoTargetEnabled=false&q=bounty%20paper%20towel&requestType=search&deviceType=desktop