我知道这肯定是一个新手问题,但是我找不到如何使用指向mp3文件的实际href链接,转到该链接并下载mp3文件(或与此相关的任何文件)。我已经尝试过Documentation和各种stackoverflow问题,但似乎无法弄清楚。
这是我的代码设置:
settings.py
USER_AGENT = 'Mozilla/5.0 (Windows NT 6.2; WOW64; rv:52.0) Gecko/20100101 Firefox/52.0'
ROBOTSTXT_OBEY = False
COOKIES_ENABLED = False
ITEM_PIPELINES = {'scrapy.pipelines.files.FilesPipeline': 1}
FILES_STORE = 'audio_files'
items.py
from scrapy.item import Item, Field
class Mp3projectItem(Item):
title = Field()
mp3_link = Field()
file_urls = Field()
# calculated fields
files = Field()
# Log fields
url = Field()
date = Field()
spider.py
import scrapy
from mp3_project.items import Mp3projectItem
class Mp3pipeSpider(scrapy.Spider):
name = 'mp3pipe'
allowed_domains = ['<thewebsite>.com']
start_urls = ['https://<thewebsite>.com/foo/bar/']
def parse(self, response):
item = Mp3projectItem()
item['title'] = response.xpath("//*[@class='spam-title']/a//text()").extract_first()
item['mp3_link'] = response.xpath("//*[@class='spam-content']//a/@href").extract_first()
item['url'] = response.url
return item
for url in response.xpath("//*[@class='spam-content']//a/@href").extract_first():
# Could I have used: for url in item['mp3_link']?
yield Request(url, callback=self.parse_item)
# in scrapy shell the response brings back the absolute url so no need
# for urlparse.urljoin(response.url, url)
# also throughs up SyntaxError: 'return' with argument inside generator
# this is obviously wrong and feels like over kill with use of pipeline
# but I don't know where to put the file_urls = Field() because the mp3 file is
# in an embedded link
def parse_item(self, response):
filename = response.url.split("/")[-1]
with open(filename, 'wb') as f:
f.write(response.body)
我几乎是肯定的,我甚至不需要打开/写入文件名,因为管道是为此目的而设计的。但是我不知道在Spider的何处放置file_urls字段并使管道正常工作。任何帮助将不胜感激。