Scrapy迭代选择器为页面上找到的选择器数量产生n个重复项目

时间:2016-07-29 12:12:46

标签: python scrapy generator

我有一个工作刮刀,我已经建立了从评论网站收集信息。我遇到的问题是,当我抓住一个包含多个评论的商家页面并尝试生成商品时,我只获得第一个商品n次(其中n是选择器找到的评论数)。

我已经在发电机上阅读了很多内容,我确信这是因为我没有正确地思考问题。 这是一个简化的片段。了解我有一个更复杂的爬虫使用回调等,但这段代码产生了我正在谈论的行为。

from scrapy import Spider
from scrapy.selector import Selector
from yelp.items import ReviewItem

class CategorySpider(Spider):
    name = "yelp_search_"
    allowed_domains = ["yelp.com"]

    start_urls = ["http://www.yelp.com/biz/j-crew-arden"]

    def parse(self, response):
        sel = Selector(response)

        # There are 9 particular reviews on this page
        reviews_info = sel.xpath('//div[contains(@class, "review review--with-sidebar") and @itemprop="review"]')
        for reviewSelector in reviews_info:
            #If I print the extracted review selector here, I can confirm that only the first review selector is being used
            #In other words, I expect extract first will extract the one and only result within the revewSelector
            #Note: if I just do extract(), the item property is populated with a list of all 9 reviewSelectors
            #i.e. a list of 9 usernames given to me 9 times
            reviewitem = ReviewItem()
            reviewitem["username"] = reviewSelector.xpath('//*[@itemprop="author"]/@content').extract_first()
            reviewitem["userprofileurl"] = reviewSelector.xpath('//*[@class="user-display-name"]/@href').extract_first()
            reviewitem["userlocation"] = reviewSelector.xpath('//*[contains(@class, "user-location responsive-hidden-small")]/text()').extract_first().strip()
            reviewitem["reviewtext"] = reviewSelector.xpath('//*[@itemprop="description"]/@content').extract_first()
            reviewitem["reviewrating"] = reviewSelector.xpath('//*[@itemprop="ratingValue"]/@content').extract_first()
            reviewitem["reviewdate"] = reviewSelector.xpath('//*[@itemprop="datePublished"]/@content').extract_first()
            reviewitem["reviewvotesuseful"] = reviewSelector.xpath('//a[@rel="useful"]/span[@class="count"]/text()').extract_first()
            yield reviewitem

这个特殊代码会给我9个抓取结果,但所有这些都是第一个reviewSelector。

我在这里做错了什么?

1 个答案:

答案 0 :(得分:1)

一旦你拥有了"子选择器" reviewSelector您需要在xpath之前使用.来指示子选择器级别。

即。这样:

reviewSelector.xpath('//*[@itemprop="author"]/@content').extract_first()

应该是:

reviewSelector.xpath('.//*[@itemprop="author"]/@content').extract_first()