使用 Scrapy 下载文件时遇到问题

时间:2021-07-12 04:30:12

标签: python web-scraping scrapy

我试图从显示 this site 的有效出价列表的表中提取数据。我是一个 Scrapy 新手,对于为什么我没有下载文件有点卡住了。我能够输出文件网址,但仍然无法从列出的网址下载文件。我无法弄清楚我缺少什么或需要改变什么。对此的任何帮助将不胜感激!

谢谢!

到目前为止,我有以下代码:

这是我的蜘蛛

from government.items import GovernmentItem
import scrapy, urllib.parse
import scrapy
from government.items import GovernmentItem


class AlabamaSpider(scrapy.Spider):
    name = 'alabama'
    allowed_domains = ['purchasing.alabama.gov']

    def start_requests(self):
        url = 'https://purchasing.alabama.gov/active-statewide-contracts/'

        yield scrapy.Request(url=url, callback=self.parse)

    
    def parse(self, response):
        for row in response.xpath('//*[@class="table table-bordered table-responsive-sm"]//tbody//tr'):

            yield {
                'Description': row.xpath('normalize-space(./td[@class="col-sm-5"])').extract_first(),
                'Bid File': row.xpath('td[@class="col-sm-1"]/a//@href').extract_first(),
                'Begin Date': row.xpath('normalize-space(./td[@class="col-sm-1"][2])').extract_first(),
                'End Date': row.xpath('normalize-space(./td[@class="col-sm-1"][3])').extract_first(),
                'Buyer Name': row.xpath('td[@class="col-sm-3"]/a//text()').extract_first(),
                'Vendor Websites': row.xpath('td[@class="col-sm-1"]/label/text()').extract_first(),
            }
   
    def parse_item(self, response):
        file_url = response.xpath('td[@class="col-sm-1"]/a//@href').get()
        #file_url = response.urljoin(file_url)
        item = GovernmentItem()
        item['file_urls'] = [file_url]
        yield item
   

这是items.py

from scrapy.item import Item, Field
import scrapy
    
class GovernmentItem(Item):
    file_urls = Field()
    files = Field()

这是我的settings.py

BOT_NAME = 'government'

SPIDER_MODULES = ['government.spiders']
NEWSPIDER_MODULE = 'government.spiders'

# Obey robots.txt rules
ROBOTSTXT_OBEY = False

# Configure item pipelines
ITEM_PIPELINES = {
    'government.pipelines.GovernmentPipeline': 1,
    'scrapy.pipelines.files.FilesPipeline': 1,
   }

FILES_STORE = '/home/ken/Desktop/Projects/scrapy/government'
FILES_URL_FIELD = 'field_urls'
FILES_RESULT_FIELD = 'files'
MEDIA_ALLOW_REDIRECTS = True
DOWNLOAD_DELAY = 1

2 个答案:

答案 0 :(得分:1)

您的代码存在一些问题:

  1. 你从不调用“parse_item”函数
  2. file_url = response.xpath('td[@class="col-sm-1"]/a//@href').get() 将不返回任何内容。你忘了在开头加上“//”。
  3. 您需要单独下载每个文件。所以用getall()获取下载链接,然后一一处理。

更正后的代码:

    def parse_all_items(self, response):
        all_urls = response.xpath('//td[@class="col-sm-1"]/a//@href').getall()
        base_url = 'https://purchasing.alabama.gov'
        for url in all_urls:
            item = GovernmentItem()
            item['file_urls'] = [base_url + url]
            yield item

它将下载所有文件。 只需确保您记得调用该函数即可。

替代解决方案:使用您已有的解析函数:

def parse(self, response):
    base_url = 'https://purchasing.alabama.gov'
    for row in response.xpath('//*[@class="table table-bordered table-responsive-sm"]//tbody//tr'):
        url = row.xpath('td[@class="col-sm-1"]/a//@href').extract_first()
        yield {
            'Description': row.xpath('normalize-space(./td[@class="col-sm-5"])').extract_first(),
            'Bid File': row.xpath('td[@class="col-sm-1"]/a//@href').extract_first(),
            'Begin Date': row.xpath('normalize-space(./td[@class="col-sm-1"][2])').extract_first(),
            'End Date': row.xpath('normalize-space(./td[@class="col-sm-1"][3])').extract_first(),
            'Buyer Name': row.xpath('td[@class="col-sm-3"]/a//text()').extract_first(),
            'Vendor Websites': row.xpath('td[@class="col-sm-1"]/label/text()').extract_first(),
        }
        if url:
            item = GovernmentItem()
            item['file_urls'] = [base_url + url]
            yield item

答案 1 :(得分:0)

你添加了吗

ITEM_PIPELINES = {'scrapy.pipelines.files.FilesPipeline': 1}
FILES_STORE = '/path/to/valid/dir'

到settings.py?

相关问题