scrapy将结果输出到一行csv中

时间:2014-10-24 17:21:22

标签: python scrapy

这类似于this,但答案对我不起作用。这实际上是最初csv output woes的后续问题。有了dreyescat的帮助,我能够让我的CrawlSpider输出到csv。但是,现在它只会打印两列(对应于我的两个字段)和一行(将所有结果转储到相应的列中)。我重新创建了dreyescat从hackernews给我的例子,它完美地工作,这就是我想要复制的东西。

这是我的代码(几乎从那个hackernews示例中复制过来):

import scrapy
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
from targets.items import TargetsItem

class MySpider(CrawlSpider):   
    name = 'reuters'
    allowed_domains = ['blogs.reuters.com']
    start_urls = [
        'http://blogs.reuters.com/us/'
    ]

    rules = (
        Rule(LinkExtractor(allow_domains=('blogs.reuters.com', )), callback='parse_item'),
    )

    def parse_item(self, response):
        item = TargetsItem()
        item['title'] = response.xpath('//h2/a/text()').extract()
        item['link'] = response.xpath('//h2/a/@href').extract()
        return item

来自控制台的编辑输出如下所示:

2014-10-24 13:04:04-0400 [reuters] DEBUG: Scraped from <200 http://blogs.reuters.com/hugo-dixon/> {'link': [u'//blogs.reuters.com/hugo-dixon/2014/10/20/markets-right-to-worry-about-euro-zone/', u'//blogs.reuters.com/hugo-dixon/2014/10/13/italy-has-no-good-plan-b/', u'//blogs.reuters.com/hugo-dixon/2014/10/06/how-to-manage-a-corporate-crisis/',
'title': [u'Markets right to worry about euro zone', u'Italy has no good Plan B', u'How to manage a corporate crisis']}

但我希望它看起来像示例dreyescat的输出给了我:

2014-10-24 13:14:54-0400 [hackernews] DEBUG: Scraped from <200 https://news.ycombinator.com/item?id=8502433> {'comment': [u"I get it - Java people want to work in Java. However, this tool seems only targeted at the M in the MVC paradigm. You still need to write your views and controllers in Objective-C. Unless your app has a large number of very complex model objects, it's probably quicker to just retype your model classes in Objective-C. Of course if your app does have a lot of very complex model objects (as Google probably does) and you want to always have them in sync across platforms without having to retype anything then this makes a ton of sense. But for the majority of apps, it does not."], 'title': [u'Google j2objc, a Java to iOS Objective-C translation tool and runtime']}

我怀疑它与我的xpath有关但在这一点上,我几乎不知道我做错了什么。希望有人可以帮助我。非常感谢!

1 个答案:

答案 0 :(得分:0)

你的xpath似乎是针对主页本身的元素。但是代码并不完全像那样:让我试着解释一下。

rules = (
    Rule(LinkExtractor(allow_domains=('blogs.reuters.com', )), callback='parse_item'),
)

上面的代码块定义了哪种链接有用(需要进一步处理)。然后,蜘蛛会获取上述域中的所有链接,并打开该页面,并将单个页面传递给parse_item函数。因此,parse_item函数中的xpath应该真正针对单击blogs.reuters.com/...链接之一时打开的页面。

在这种情况下,主页中的链接会导致单个文章。我刚刚检查过使用xpath //h2/text()可以捕获文章的标题。

所以也许您应该将parse_item函数更改为:

def parse_item(self, response):
    item = TargetsItem()
    item['title'] = response.xpath('//h2/text()').extract()
    item['link'] = response.xpath('<insert an xpath to obtain the link from the news post>').extract()
    return item

请记住,parse_item获取域blogs.reuters.com中的所有链接。您必须编写理解每个链接页面的xpath。

我在页面中找不到该页面的链接。在这种情况下你可以使用网址:

    item['link'] = response.url #or something. read the manual