我正在尝试使用Scrapy python库在https://www.walmart.com/search/?query=ps3&cat_id=0上抓取所有产品名称。
这是我的解析函数
def parseWalmart(self,response):
print("INSIDE PARSE WALMART")
for product in response.xpath('//div[@id="searchProductResult"]/div[@class="search-result-listview-items"]//div[starts-with(@data-tl-id,"ProductTileListView-")]'):
print(product)
product_name = product.xpath('.//div[contains(@class,"search-result-product-title listview")]//a//span//text()').extract()
product_page = product.xpath('.//div[contains(@class,"search-result-product-title listview")]//a/@href').extract()
product_name=" ".join(product_name)
print(product_name)
print("-------------------------------------")
这是我的拼命请求
yield scrapy.Request(url=i, callback=self.parseWalmart, headers = {"User-Agent":"Mozilla/5.0"})
但是,我实际上只能刮掉4种产品,而实际上却只有12种。我不明白为什么。这是我抓取的4种产品
<Selector xpath='//div[@id="searchProductResult"]/div[@class="search-result-listview-items"]//div[starts-with(@data-tl-id,"ProductTileListView-")]' data='<div data-tl-id="ProductTileListView-0">'>
ABLEGRID Wireless Bluetooth Game Controller for Sony PS3 Black
-------------------------------------
<Selector xpath='//div[@id="searchProductResult"]/div[@class="search-result-listview-items"]//div[starts-with(@data-tl-id,"ProductTileListView-")]' data='<div data-tl-id="ProductTileListView-1">'>
Arsenal Gaming PS3 Wired Controller, Black
-------------------------------------
<Selector xpath='//div[@id="searchProductResult"]/div[@class="search-result-listview-items"]//div[starts-with(@data-tl-id,"ProductTileListView-")]' data='<div data-tl-id="ProductTileListView-2">'>
Refurbished Sony PlayStation 3 Slim 320 GB Charcoal Black Console
-------------------------------------
<Selector xpath='//div[@id="searchProductResult"]/div[@class="search-result-listview-items"]//div[starts-with(@data-tl-id,"ProductTileListView-")]' data='<div data-tl-id="ProductTileListView-3">'>
Sonic's Ultimate Genesis Collection ( PS3 )
-------------------------------------
答案 0 :(得分:0)
因为最初在DOM中只有4个以“ ProductTileListView-”开头的div。但是,您可以在页面的脚本中找到所有产品信息。
这是我获取产品所有信息的方式
import re
import json
data = re.findall("\"items\":(.+?),\"secondaryItems\"", response.body.decode("utf-8"), re.S)
products_json = json.loads(data[0])
len(ls) # return 20
请注意,products数组以“ items”:开头,以“ secondaryItems”结尾。
一种产品的结构
{
"productId": "2H53I08Z1K78",
"usItemId": "23422902",
"productType": "REGULAR",
"title": "Watch Dogs (<mark>PS3</mark>)",
....
"imageUrl": "https://i5.walmartimages.com/asr/70aecbb1-5dbf-4a64-a86d-134a8fc7edee_2.59805d79db07665c20cc4e4fadc35743.jpeg?odnHeight=180&odnWidth=180&odnBg=ffffff",
"productPageUrl": "/ip/Watch-Dogs-PS3/23422902",
"upc": "0000888834804",
}