使用scrapy刮擦物品

时间:2017-04-15 07:11:38

标签: python xpath scrapy web-crawler scrapy-spider

我编写了以下蜘蛛用于抓取webmd网站以进行患者评估

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector


class MySpider(BaseSpider):
    name = "webmd"
    allowed_domains = ["webmd.com"]
    start_urls = ["http://www.webmd.com/drugs/drugreview-92884-Boniva"]

    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        titles = hxs.select("//p")
        title = titles.select("//p[contains(@class, 'comment')and contains(@style, 'display:none')]/text()").extract()
        print(title)

执行此代码可以获得所需的输出但是有很多重复,即相同的注释重复至少10次。 帮我解决这个问题。

2 个答案:

答案 0 :(得分:3)

您可以像这样重写您的蜘蛛代码:

import scrapy

# Your Items 
class ReviewItem(scrapy.Item):
    review = scrapy.Field()


class WebmdSpider(scrapy.Spider):
    name = "webmd"
    allowed_domains = ["webmd.com"]
    start_urls = ['http://www.webmd.com/drugs/drugreview-92884-Boniva']

    def parse(self, response):
        titles = response.xpath('//p[contains(@id, "Full")]')
        for title in titles:
            item = ReviewItem()
            item['review'] = title.xpath('text()').extract_first()
            yield item

        # Checks if there is a next page link, and keeping parsing if True    
        next_page = response.xpath('(//a[contains(., "Next")])[1]/@href').extract_first()
        if next_page:
            yield scrapy.Request(response.urljoin(next_page), callback=self.parse)

它仅选择完整的客户评论而不重复,并将其保存在Scrapy Items中。 注意:您可以使用更方便的快捷方式HtmlXPathSelector代替response。另外,我将已弃用的scrapy.BaseSpider更改为scrapy.Spider

要将评论保存为csv格式,您只需使用Scrapy Feed exports并输入控制台scrapy crawl webmd -o reviews.csv

答案 1 :(得分:2)

您可以使用sets获取唯一评论。我希望您知道选择器将结果作为list返回,因此如果您使用集合,那么您将只获得唯一的结果。所以

def parse(self,response):
    hxs = HtmlXPathSelector(response)
    titles = hxs.select("//p")
    title = set(titles.select("//p[contains(@class, 'comment')and contains(@style, 'display:none')]/text()").extract())
    print (title) #this will have only unique results.