我有一个工作刮刀,我已经建立了从评论网站收集信息。我遇到的问题是,当我抓住一个包含多个评论的商家页面并尝试生成商品时,我只获得第一个商品n次(其中n是选择器找到的评论数)。
我已经在发电机上阅读了很多内容,我确信这是因为我没有正确地思考问题。 这是一个简化的片段。了解我有一个更复杂的爬虫使用回调等,但这段代码产生了我正在谈论的行为。
from scrapy import Spider
from scrapy.selector import Selector
from yelp.items import ReviewItem
class CategorySpider(Spider):
name = "yelp_search_"
allowed_domains = ["yelp.com"]
start_urls = ["http://www.yelp.com/biz/j-crew-arden"]
def parse(self, response):
sel = Selector(response)
# There are 9 particular reviews on this page
reviews_info = sel.xpath('//div[contains(@class, "review review--with-sidebar") and @itemprop="review"]')
for reviewSelector in reviews_info:
#If I print the extracted review selector here, I can confirm that only the first review selector is being used
#In other words, I expect extract first will extract the one and only result within the revewSelector
#Note: if I just do extract(), the item property is populated with a list of all 9 reviewSelectors
#i.e. a list of 9 usernames given to me 9 times
reviewitem = ReviewItem()
reviewitem["username"] = reviewSelector.xpath('//*[@itemprop="author"]/@content').extract_first()
reviewitem["userprofileurl"] = reviewSelector.xpath('//*[@class="user-display-name"]/@href').extract_first()
reviewitem["userlocation"] = reviewSelector.xpath('//*[contains(@class, "user-location responsive-hidden-small")]/text()').extract_first().strip()
reviewitem["reviewtext"] = reviewSelector.xpath('//*[@itemprop="description"]/@content').extract_first()
reviewitem["reviewrating"] = reviewSelector.xpath('//*[@itemprop="ratingValue"]/@content').extract_first()
reviewitem["reviewdate"] = reviewSelector.xpath('//*[@itemprop="datePublished"]/@content').extract_first()
reviewitem["reviewvotesuseful"] = reviewSelector.xpath('//a[@rel="useful"]/span[@class="count"]/text()').extract_first()
yield reviewitem
这个特殊代码会给我9个抓取结果,但所有这些都是第一个reviewSelector。
我在这里做错了什么?
答案 0 :(得分:1)
一旦你拥有了"子选择器" reviewSelector
您需要在xpath之前使用.
来指示子选择器级别。
即。这样:
reviewSelector.xpath('//*[@itemprop="author"]/@content').extract_first()
应该是:
reviewSelector.xpath('.//*[@itemprop="author"]/@content').extract_first()