我正在scrapy
运行python script
。
有人告诉我,scrapy
responses
内置parse()
并在pipeline.py
进一步处理。
到目前为止我的framework
设置方式如下:
python脚本
def script(self):
process = CrawlerProcess(get_project_settings())
response = process.crawl('pitchfork_albums', domain='pitchfork.com')
process.start() # the script will block here until the crawling is finished
蜘蛛
class PitchforkAlbums(scrapy.Spider):
name = "pitchfork_albums"
allowed_domains = ["pitchfork.com"]
#creates objects for each URL listed here
start_urls = [
"http://pitchfork.com/reviews/best/albums/?page=1",
"http://pitchfork.com/reviews/best/albums/?page=2",
"http://pitchfork.com/reviews/best/albums/?page=3"
]
def parse(self, response):
for sel in response.xpath('//div[@class="album-artist"]'):
item = PitchforkItem()
item['artist'] = sel.xpath('//ul[@class="artist-list"]/li/text()').extract()
item['album'] = sel.xpath('//h2[@class="title"]/text()').extract()
yield item
items.py
class PitchforkItem(scrapy.Item):
artist = scrapy.Field()
album = scrapy.Field()
settings.py
ITEM_PIPELINES = {
'blogs.pipelines.PitchforkPipeline': 300,
}
pipelines.py
class PitchforkPipeline(object):
def __init__(self):
self.file = open('tracks.jl', 'wb')
def process_item(self, item, spider):
line = json.dumps(dict(item)) + "\n"
self.file.write(line)
for i in item:
return i['album'][0]
如果我return item
中只有pipelines.py
,我会得到这样的数据(每response
页一个html
):
{'album': [u'Sirens',
u'I Had a Dream That You Were Mine',
u'Sunergy',
u'Skeleton Tree',
u'My Woman',
u'JEFFERY',
u'Blonde / Endless',
u' A Mulher do Fim do Mundo (The Woman at the End of the World) ',
u'HEAVN',
u'Blank Face LP',
u'blackSUMMERS\u2019night',
u'Wildflower',
u'Freetown Sound',
u'Trans Day of Revenge',
u'Puberty 2',
u'Light Upon the Lake',
u'iiiDrops',
u'Teens of Denial',
u'Coloring Book',
u'A Moon Shaped Pool',
u'The Colour in Anything',
u'Paradise',
u'HOPELESSNESS',
u'Lemonade'],
'artist': [u'Nicolas Jaar',
u'Hamilton Leithauser',
u'Rostam',
u'Kaitlyn Aurelia Smith',
u'Suzanne Ciani',
u'Nick Cave & the Bad Seeds',
u'Angel Olsen',
u'Young Thug',
u'Frank Ocean',
u'Elza Soares',
u'Jamila Woods',
u'Schoolboy Q',
u'Maxwell',
u'The Avalanches',
u'Blood Orange',
u'G.L.O.S.S.',
u'Mitski',
u'Whitney',
u'Joey Purp',
u'Car Seat Headrest',
u'Chance the Rapper',
u'Radiohead',
u'James Blake',
u'White Lung',
u'ANOHNI',
u'Beyonc\xe9']}
我希望在pipelines.py
中执行的操作是为每个songs
获取单独的item
,如下所示:
[u'Sirens']
请帮帮忙?
答案 0 :(得分:3)
我建议您在蜘蛛中构建结构良好的item
。在Scrapy Framework工作流程中,spider用于构建格式良好的项目,例如,解析html,填充项目实例和管道用于对项目进行操作,例如过滤项目,商店项目。
对于您的应用程序,如果我理解正确,每个项目应该是描述相册的条目。所以在削减HTML时,你最好建立这样的项目,而不是将所有东西都集中到项目中。
因此,在spider.py
,parse
函数中,您应该
yield item
语句放入for
循环,而不是外部。这样,每张专辑都会生成一个项目。.//
而不是//
,并指定self,请使用./
而不是/
。< / LI>
理想情况下,专辑标题应为标量,专辑艺术家应为列表,因此请尝试extract_first
将专辑标题设为标量。
def parse(self, response):
for sel in response.xpath('//div[@class="album-artist"]'):
item = PitchforkItem()
item['artist'] = sel.xpath('./ul[@class="artist-list"]/li/text()').extract_first()
item['album'] = sel.xpath('./h2[@class="title"]/text()').extract()
yield item
希望这会有所帮助。