我有一个项目管道,它从项目中获取一个url并下载它。问题是我有另一个管道,我在其中手动检查此文件并添加一些有关它的信息。在下载文件之前我真的需要这样做。
class VideoCommentPipeline(object):
def process_item(self, item, spider):
os.system("vlc -vvv %s > /dev/null 2>&1 &" % item['file'])
item['comment'] = raw_input('Your comment:')
return item
class VideoDownloadPipeline(object):
def process_item(self, item, spider):
video_basename = item['file'].split('/')[-1]
new_filename = os.path.join(VIDEOS_DIR, video_basename)
downloaded = False
for i in range(5):
try:
video = urllib2.urlopen(item['file']).read()
downloaded = True
break
except:
continue
if not downloaded:
raise DropItem("Couldn't download file from %s" % item)
f = open(new_filename, 'wb')
f.write(video)
f.close()
item['file'] = video_basename
return item
但是现在我总是要等待另一个项目,因为之前项目的文件还没有下载。我宁愿检查所有项目,然后将其全部下载。我怎么能这样做?
答案 0 :(得分:3)
Scrapy提供media pipeline,可在此处用于您的目的。它没有很好地记录,但它存在并且可以使用,至少在最近的scrapy版本中。要了解它是如何工作的,你需要阅读代码,这是非常直观的IMO。您可以查看image pipeline界面以了解媒体管道的工作原理。
要在下载之前检查每个视频,您可以写一些类似的内容(您需要将其与您的项目字段名称相匹配)
from scrapy.contrib.pipeline.media import MediaPipeline
class VideoPipeline(MediaPipeline):
VIDEOS_DIR = "/stack/scrapy/video/video/store"
def get_media_requests(self, item, info):
"""
Evaluate file and, if you like it, download it.
"""
os.system("vlc -vvv %s > /dev/null 2>&1 &" % item['video_url'][0])
your_opinion = raw_input("how does it look?")
item["comment"] = your_opinion
if your_opinion == "hot":
# issue request download video
return Request(item["video_url"][0], meta={"item":item})
def media_downloaded(self, response, request, info):
"""
File is downloaded available as response.body save it.
"""
item = response.meta.get("item")
video = response.body
video_basename = item['title'][0]
new_filename = os.path.join(self.VIDEOS_DIR, video_basename)
f = open(new_filename, 'wb')
f.write(video)
f.close()