只有一行输出scrapy到json文件

时间:2014-05-03 09:40:56

标签: python json scrapy

好的,所以我是一般的编程新手,并且特定地将Scrapy用于此目的。我写了一个爬虫来从pinterest.com上的引脚获取数据。问题是我曾经从我抓取的页面上的所有引脚获取数据,但现在我只得到第一个引脚的数据。

我认为问题在于管道或蜘蛛本身。在我添加" strip"之后发生了一些变化。蜘蛛摆脱空白,但当我改回它时,我得到了相同的输出,但随后是空白。这是蜘蛛:

from scrapy.spider import Spider
from scrapy.selector import Selector
from Pinterest.items import PinterestItem

class PinterestSpider(Spider):
    name = "pinterest"
    allowed_domains = ["pinterest.com"]
    start_urls = ["http://www.pinterest.com/llbean/pins/"]

    def parse(self, response):
        hxs = Selector(response)
        item = PinterestItem()
        items = []
        item ["pin_link"] = hxs.xpath("//div[@class='pinHolder']/a/@href").extract()[0].strip()
        item ["repin_count"] = hxs.xpath("//em[@class='socialMetaCount repinCountSmall']/text()").extract()[0].strip()
        item ["like_count"] = hxs.xpath("//em[@class='socialMetaCount likeCountSmall']/text()").extract()[0].strip()
        item ["board_name"] = hxs.xpath("//div[@class='creditTitle']/text()").extract()[0].strip()
        items.append(item)
        return items

这是我的管道:

from scrapy.xlib.pydispatch import dispatcher
from scrapy import signals
from scrapy.contrib.exporter import JsonLinesItemExporter

class JsonLinesExportPipeline(object):

    def __init__(self):
        dispatcher.connect(self.spider_opened, signals.spider_opened)
        dispatcher.connect(self.spider_closed, signals.spider_closed)
        self.files = {}

    def spider_opened(self, spider):
        file = open('%s_items.json' % spider.name, 'w+b')
        self.files[spider] = file
        self.exporter = JsonLinesItemExporter(file)
        self.exporter.start_exporting()

    def spider_closed(self, spider):
        self.exporter.finish_exporting()
        file = self.files.pop(spider)
        file.close()

    def process_item(self, item, spider):
        self.exporter.export_item(item)
        return item

当我使用命令" scrapy crawl pinterest"这是我在JSON文件中得到的输出:

"pin_link": "/pin/94716398388365841/", "board_name": "Outdoor Fun", "like_count": "14", "repin_count": "94"}

这正是我想要的输出,但我只从一个引脚获得,而不是从页面上的所有引脚获得。我花了很多时间阅读类似的问题,但我找不到任何类似的问题。关于什么是错的任何想法?提前谢谢!

编辑:哦,我猜它是因为剥离功能之前的[0]?对不起,我刚才意识到这可能是问题...

编辑:嗯,这不是问题。我很确定它必须对strip函数做一些事情,但我似乎无法正确使用它来获得多个引脚作为输出。解决方案可以成为这个问题的一部分吗?:Scrapy: Why extracted strings are in this format?我看到一些重叠,但我不知道如何使用它。

编辑:好的,所以当我像这样修改蜘蛛时:

from scrapy.spider import Spider
from scrapy.selector import Selector
from Pinterest.items import PinterestItem

class PinterestSpider(Spider):
name = "pinterest"
allowed_domains = ["pinterest.com"]
start_urls = ["http://www.pinterest.com/llbean/pins/"]

def parse(self, response):
    hxs = Selector(response)
    sites = hxs.xpath("//div[@class='pinWrapper']")
    items = []
    for site in sites:
        item = PinterestItem()        
        item ["pin_link"] = site.select("//div[@class='pinHolder']/a/@href").extract()[0].strip()
        item ["repin_count"] = site.select("//em[@class='socialMetaCount repinCountSmall']/text()").extract()[0].strip()
        item ["like_count"] = site.select("//em[@class='socialMetaCount likeCountSmall']/text()").extract()[0].strip()
        item ["board_name"] = site.select("//div[@class='creditTitle']/text()").extract()[0].strip()
        items.append(item)
    return items

它确实给了我几行输出,但显然都具有相同的信息,因此它抓取了页面上多少个引脚的项目,但都具有相同的输出:

{"pin_link": "/pin/94716398388371133/", "board_name": "Take Me Fishing", "like_count": "3", "repin_count": "21"}
{"pin_link": "/pin/94716398388371133/", "board_name": "Take Me Fishing", "like_count": "3", "repin_count": "21"}
{"pin_link": "/pin/94716398388371133/", "board_name": "Take Me Fishing", "like_count": "3", "repin_count": "21"}
{"pin_link": "/pin/94716398388371133/", "board_name": "Take Me Fishing", "like_count": "3", "repin_count": "21"}

1 个答案:

答案 0 :(得分:3)

我没有使用Scrapy,所以这是一个疯狂的猜测。

您的选择器正在撤回多个结果。然后,您可以从每个列表中选择第一个值(使用切片[0]),创建一个名为PinterestItem item,您可以将其添加到返回之前列出items列表。似乎没有任何东西循环选择器返回的所有可能结果。

因此,取出所有结果,然后迭代它们以创建items列表:

def parse(self, response):
    hxs = Selector(response)
    pin_links = hxs.xpath("//div[@class='pinHolder']/a/@href").extract()
    repin_counts = hxs.xpath("//em[@class='socialMetaCount repinCountSmall']/text()").extract()
    like_counts = hxs.xpath("//em[@class='socialMetaCount likeCountSmall']/text()").extract()
    board_names = hxs.xpath("//div[@class='creditTitle']/text()").extract()

    items = []
    for pin_link, repin_count, like_count, board_name in zip(pin_links, repin_counts, like_counts, board_names):
        item = PinterestItem()
        item["pin_link"] = pin_link.strip()
        item["repin_count"] = repin_count.strip()
        item["like_count"] = like_count.strip()
        item["board_name"] = board_name.strip()
        items.append(item)
    return items