Question

我正在抓取一个页面，该页面在相同的<@div（粗体）xpath中具有36个<@hrefs，所以当我尝试获取这些页面时，即使在刮板外壳上，它也只能获得相同的12个<@hrefs时间，而且顺序不对。

我正在使用这种方式： response.xpath（'/ html / body / div [1] / div [2] / section / div / div [3] / div [2] / div / div [2] // div // article // div [1] // a [re：test（@href，“ pd”）] // @ href'）。getall（）

来自以下页面： https://www.lowes.com/pl/Bottom-freezer-refrigerators-Refrigerators-Appliances/4294789499?offset=36

Answer 1

似乎部分html是动态加载的，因此scrapy无法看到它。数据本身存在于html中的json结构中。您可以尝试像这样获得它：

import json
# get the script with the data
json_data = response.xpath('//script[contains(text(), "__PRELOADED_STATE__")]/text()').extract_first()
# load the data in a python dictionary
dict_data = json.loads(json_data.split('window.__PRELOADED_STATE__ =')[-1])
items = dict_data['itemList']
print(len(items))  # prints 36 in my case
# go through the dictionary and get the product_urls
for item in items:
  product_url = item['product']['pdURL']
  ...

Scrapy请求得到一些响应，但不是全部

1 个答案: