Scrapy - 处理带有管道的物品

时间:2016-09-29 19:01:34

标签: python scrapy

我正在scrapy运行python script

有人告诉我,scrapy responses内置parse()并在pipeline.py进一步处理。

到目前为止我的framework设置方式如下:

python脚本

def script(self):

        process = CrawlerProcess(get_project_settings())

        response = process.crawl('pitchfork_albums', domain='pitchfork.com')

        process.start() # the script will block here until the crawling is finished

蜘蛛

class PitchforkAlbums(scrapy.Spider):
    name = "pitchfork_albums"
    allowed_domains = ["pitchfork.com"]
    #creates objects for each URL listed here
    start_urls = [
                    "http://pitchfork.com/reviews/best/albums/?page=1",
                    "http://pitchfork.com/reviews/best/albums/?page=2",
                    "http://pitchfork.com/reviews/best/albums/?page=3"                   
    ]
    def parse(self, response):

        for sel in response.xpath('//div[@class="album-artist"]'):
            item = PitchforkItem()
            item['artist'] = sel.xpath('//ul[@class="artist-list"]/li/text()').extract()
            item['album'] = sel.xpath('//h2[@class="title"]/text()').extract()

        yield item

items.py

class PitchforkItem(scrapy.Item):

    artist = scrapy.Field()
    album = scrapy.Field()

settings.py

ITEM_PIPELINES = {
   'blogs.pipelines.PitchforkPipeline': 300,
}

pipelines.py

class PitchforkPipeline(object):

    def __init__(self):
        self.file = open('tracks.jl', 'wb')

    def process_item(self, item, spider):
        line = json.dumps(dict(item)) + "\n"
        self.file.write(line)
        for i in item:
            return i['album'][0]

如果我return item中只有pipelines.py,我会得到这样的数据(每response页一个html):

{'album': [u'Sirens',
           u'I Had a Dream That You Were Mine',
           u'Sunergy',
           u'Skeleton Tree',
           u'My Woman',
           u'JEFFERY',
           u'Blonde / Endless',
           u' A Mulher do Fim do Mundo (The Woman at the End of the World) ',
           u'HEAVN',
           u'Blank Face LP',
           u'blackSUMMERS\u2019night',
           u'Wildflower',
           u'Freetown Sound',
           u'Trans Day of Revenge',
           u'Puberty 2',
           u'Light Upon the Lake',
           u'iiiDrops',
           u'Teens of Denial',
           u'Coloring Book',
           u'A Moon Shaped Pool',
           u'The Colour in Anything',
           u'Paradise',
           u'HOPELESSNESS',
           u'Lemonade'],
 'artist': [u'Nicolas Jaar',
            u'Hamilton Leithauser',
            u'Rostam',
            u'Kaitlyn Aurelia Smith',
            u'Suzanne Ciani',
            u'Nick Cave & the Bad Seeds',
            u'Angel Olsen',
            u'Young Thug',
            u'Frank Ocean',
            u'Elza Soares',
            u'Jamila Woods',
            u'Schoolboy Q',
            u'Maxwell',
            u'The Avalanches',
            u'Blood Orange',
            u'G.L.O.S.S.',
            u'Mitski',
            u'Whitney',
            u'Joey Purp',
            u'Car Seat Headrest',
            u'Chance the Rapper',
            u'Radiohead',
            u'James Blake',
            u'White Lung',
            u'ANOHNI',
            u'Beyonc\xe9']}

我希望在pipelines.py中执行的操作是为每个songs获取单独的item,如下所示:

[u'Sirens']

请帮帮忙?

1 个答案:

答案 0 :(得分:3)

我建议您在蜘蛛中构建结构良好的item。在Scrapy Framework工作流程中,spider用于构建格式良好的项目,例如,解析html,填充项目实例和管道用于对项目进行操作,例如过滤项目,商店项目。

对于您的应用程序,如果我理解正确,每个项目应该是描述相册的条目。所以在削减HTML时,你最好建立这样的项目,而不是将所有东西都集中到项目中。

因此,在spider.pyparse函数中,您应该

  1. yield item语句放入for循环,而不是外部。这样,每张专辑都会生成一个项目。
  2. 注意Scrapy中的相对xpath选择器。如果要使用相对xpath选择器指定自我和后代,请使用.//而不是//,并指定self,请使用./而不是/。< / LI>
  3. 理想情况下,专辑标题应为标量,专辑艺术家应为列表,因此请尝试extract_first将专辑标题设为标量。

    def parse(self, response):
    for sel in response.xpath('//div[@class="album-artist"]'):
        item = PitchforkItem()
        item['artist'] = sel.xpath('./ul[@class="artist-list"]/li/text()').extract_first()
        item['album'] = sel.xpath('./h2[@class="title"]/text()').extract()
        yield item
    
  4. 希望这会有所帮助。