Question

我正在使用Scrapy 1.4.0和Python 3.6.3。

我正在尝试读取通过“close”方法中的“-o items.csv”创建的csv文件 - 然后将其写入MySQL。但是它只读取当前运行之前的csv文件中的内容。有没有办法关闭csv文件或其他方式强制读取“关闭”中的csv文件以读取“解析”内的更新？

源代码：

import glob
import csv
import os
import MySQLdb as sql

from scrapy import Spider
from scrapy.http import Request

def product_info(response, value):
    return response.xpath('//th[text()="' + value +'"]/following-sibling::td/text()').extract_first()

class Books2Spider(Spider):
    name = 'books2'
    allowed_domains = ['books.toscrape.com']
    start_urls = ('http://books.toscrape.com//',)

    def parse(self,response):
        books = response.xpath('//h3/a/@href').extract()
        for book in books:
            absolute_url = response.urljoin(book)
            yield Request(absolute_url,callback=self.parse_book)                

    def parse_book(self, response):
        title = response.xpath('//h1/text()').extract_first()           
        rating = response.xpath('//*[contains(@class,"star-rating")]/@class').extract_first()
        rating = rating.replace('star-rating ','')
        upc = product_info(response,'UPC')
        product_type = product_info(response,'Product Type')

        yield {
        'title' : title,
        'rating': rating,
        'upc' : upc,
        'product_type': product_type
        }

    def close(self, reason):
        csv_file = max(glob.iglob('*.csv'),key=os.path.getctime)

        fr = open(csv_file, 'r')
        csv.reader(fr)
        fr.close()

        mydb = sql.connect(host='localhost',user='root',
        passwd='password',db='books_db')
        print(csv_file)
        cursor = mydb.cursor()

        csv_data = csv.reader(open(csv_file,'r'))

        row_count = 0
        for row in csv_data:
            if row_count != 0:
                cursor.execute('INSERT IGNORE INTO books_table(title, rating, upc, product_type) VALUES("{}", "{}", "{}", "{}")'.format(row[0],row[1],row[2],row[3]))
            row_count += 1

        mydb.commit()
        cursor.close()

Answer 1

我认为现有的两种解决方法可以绕过你的问题。

使用scrapy提供的item pipeline。在process_item方法中实现您自己的管道，处理从parse_book采购的每个项目并将已删除的项目存储到MySQL中。
运行蜘蛛时导出csv文件，通过-o items.csv添加设置，然后在另一个脚本中读取和存储导出。

从Feed Exporter创建的CSV中读取不包括当前运行的更新

1 个答案: