Question

我知道这肯定是一个新手问题，但是我找不到如何使用指向mp3文件的实际href链接，转到该链接并下载mp3文件（或与此相关的任何文件）。我已经尝试过Documentation和各种stackoverflow问题，但似乎无法弄清楚。

这是我的代码设置：

settings.py

USER_AGENT = 'Mozilla/5.0 (Windows NT 6.2; WOW64; rv:52.0) Gecko/20100101 Firefox/52.0'

ROBOTSTXT_OBEY = False
COOKIES_ENABLED = False
ITEM_PIPELINES = {'scrapy.pipelines.files.FilesPipeline': 1}
FILES_STORE = 'audio_files'

items.py

from scrapy.item import Item, Field


class Mp3projectItem(Item):
    title = Field()
    mp3_link = Field()
    file_urls = Field()

    # calculated fields
    files = Field()
    # Log fields
    url = Field()
    date = Field()

spider.py

import scrapy

from mp3_project.items import Mp3projectItem


class Mp3pipeSpider(scrapy.Spider):
    name = 'mp3pipe'
    allowed_domains = ['<thewebsite>.com']
    start_urls = ['https://<thewebsite>.com/foo/bar/']

    def parse(self, response):
        item = Mp3projectItem()
        item['title'] = response.xpath("//*[@class='spam-title']/a//text()").extract_first()
        item['mp3_link'] = response.xpath("//*[@class='spam-content']//a/@href").extract_first()
        item['url'] = response.url
        return item

        for url in response.xpath("//*[@class='spam-content']//a/@href").extract_first():
            # Could I have used: for url in item['mp3_link']?
            yield Request(url, callback=self.parse_item)
            # in scrapy shell the response brings back the absolute url so no need
            # for urlparse.urljoin(response.url, url)
            # also throughs up SyntaxError: 'return' with argument inside generator


# this is obviously wrong and feels like over kill with use of pipeline
# but I don't know where to put the file_urls = Field() because the mp3 file is
# in an embedded link
def parse_item(self, response):
    filename = response.url.split("/")[-1]
    with open(filename, 'wb') as f:
        f.write(response.body)

我几乎是肯定的，我甚至不需要打开/写入文件名，因为管道是为此目的而设计的。但是我不知道在Spider的何处放置file_urls字段并使管道正常工作。任何帮助将不胜感激。

使用ITEM_PIPELINES从页面上的嵌入式HREF链接中下载媒体文件

0 个答案: