Question

我喜欢抓取一个看起来像这样的网址： https://steamcommunity.com/market/search?appid=730#p1_popular_desc

由于End是动态的，因此我在解析中创建了网址列表，然后进行了请求循环。

问题是，他在appid = 730之后剪切了网址-因此每个网址看起来都一样。如果我切换到dont_filter = true，我看到他在page1上一次又一次地循环。我没有问题:(

代码中的“ x”将在以后动态化（需要使用start_url），认为与问题无关。

似乎他总是从引荐来源网址抓取，而不是我给他的网址。该网址可能不以730结尾。

调试消息：

...

2019-03-28 23:44:36 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://steamcommunity.com/market/search?appid=730> (referer: None)

2019-03-28 23:44:37 [scrapy.core.engine] DEBUG: Crawled (200) **<GET https://steamcommunity.com/market/search?appid=730#p7_popular_desc> (referer: https://steamcommunity.com/market/search?appid=730)**

...

2019-03-28 23:44:37 [scrapy.core.scraper] DEBUG: Scraped from <200 https://steamcommunity.com/market/search?appid=730>
{'item_count': u'7,899',
 'item_name': u'Prisma Case',
 'item_price': u'$2.79 USD',
 'item_subtext': u'Counter-Strike: Global Offensive'}
2019-03-28 23:44:37 [scrapy.core.scraper] DEBUG: **Scraped from <200 https://steamcommunity.com/market/search?appid=730>**
{'item_count': u'192,519',
 'item_name': u'Danger Zone Case',
 'item_price': u'$0.30 USD',
 'item_subtext': u'Counter-Strike: Global Offensive'}

allowed_domains = ['steamcommunity.com/market']
start_urls = ['https://steamcommunity.com/market/search?appid=730']

def parse(self, response):
    x = 15 
    steam_xpath = [u'//steamcommunity.com/market/search?appid=730#p'+str(i)+'_popular_desc' for i in range(1, x)]
    for link in steam_xpath:
        yield Request(response.urljoin(link), self.parse_steam, dont_filter=True)

def parse_steam(self, response):
    xitem_name = response.xpath('//span[@class="market_listing_item_name"]/text()').extract()
    xitem_price = response.xpath('//span[@class="normal_price"]/text()').extract()
    xitem_subtext = response.xpath('//span[@class="market_listing_game_name"]/text()').extract()
    xitem_count = response.xpath('//span[@class="market_listing_num_listings_qty"]/text()').extract()
    for item in zip(xitem_name, xitem_price, xitem_subtext, xitem_count):
        new_item = SteammarketItem()
        new_item['item_name'] = item[0]
        new_item['item_price'] = item[1]
        new_item['item_subtext'] = item[2]
        new_item['item_count'] = item[3]
        yield new_item

预期：150个结果，循环中每个URL 10个。

实际：15个结果，但每10次-全部来自第一个网址。

Answer 1

地址栏上的URL随您说而出现，但是如果您在浏览器开发人员工具的“网络”标签上检查请求，您将看到返回新项目的请求是这样的：

https://steamcommunity.com/market/search/render/?query=&start=0&count=10&search_descriptions=0&sort_column=popular&sort_dir=desc&appid=730

此Json在字段results_html上包含页面HTML，如果要使用xpath获取数据，则可以使用此值创建选择器。

import json

def parse(self, response):
    data = json.loads(response.text)
    sel = scrapy.Selector(text=data['results_html'])
    # then use sel
    value = sel.xpath('//value').get()

在读取该URL的响应时，您还可以注意到，有tip表示还可以向URL添加参数&norender=1并且完全不使用HTML。因此，由您自己选择最适合自己的东西。

许多网站都这样做，因此您必须密切注意请求，而不必总是信任地址栏上显示的内容。我建议您甚至不要相信“检查器”上显示的内容，并始终检查源代码（右键单击>查看页面源代码）。

Scrapy Request以某种方式削减了网址

调试消息：

1 个答案: