如何在数据库中保存抓取的数据?

时间:2019-04-01 11:43:35

标签: python-3.x scrapy mysql-workbench

我正在尝试将抓取的数据保存在db中,但是卡住了,

首先,我已将抓取的数据保存在csv文件中,并使用glob库查找最新的csv并将该csv的数据上传到db-

我不确定在这里做错了什么,请找到代码并输入错误 我已经在db中创建了表yahoo_data,其列名与csv相同,并且我的代码输出

import scrapy
from scrapy.http import Request
import MySQLdb
import os
import csv
import glob

class YahooScrapperSpider(scrapy.Spider):
    name = 'yahoo_scrapper'
    allowed_domains = ['in.news.yahoo.com']
    start_urls = ['http://in.news.yahoo.com/']

    def parse(self, response):
        news_url=response.xpath('//*[@class="Mb(5px)"]/a/@href').extract()
        for url in news_url:
            absolute_url=response.urljoin(url)
            yield Request (absolute_url,callback=self.parse_text)

    def parse_text(self,response):
        Title=response.xpath('//meta[contains(@name,"twitter:title")]/@content').extract_first()
        # response.xpath('//*[@name="twitter:title"]/@content').extract_first(),this also works
        Article= response.xpath('//*[@class="canvas-atom canvas-text Mb(1.0em) Mb(0)--sm Mt(0.8em)--sm"]/text()').extract()
        yield {'Title':Title,
               'Article':Article}

    def close(self, reason):
        csv_file = max(glob.iglob('*.csv'), key=os.path.getctime)
        mydb = MySQLdb.connect(host='localhost',
                               user='root',
                               passwd='prasun',
                               db='books')
        cursor = mydb.cursor()
        csv_data = csv.reader(csv_file)

        row_count = 0
        for row in csv_data:
            if row_count != 0:
                cursor.execute('INSERT IGNORE INTO yahoo_data (Title,Article) VALUES(%s, %s)', row)
            row_count += 1

        mydb.commit()
        cursor.close()

遇到此错误

ana. It should be directed not to disrespect the Sikh community and hurt its sentiments by passing such arbitrary and uncalled for orders," said Badal.', 'The SAD president also "brought it to the notice of the Haryana chief minister that Article 25 of the constitution safeguarded the rights of all citizens to profess and practices the tenets of their faith."', '"Keeping these facts in view I request you to direct the Haryana Public Service Commission to rescind its notification and allow Sikhs as well as candidates belonging to other religions to sport symbols of their faith during all examinations," said Badal. (ANI)']}
2019-04-01 16:49:41 [scrapy.core.engine] INFO: Closing spider (finished)
2019-04-01 16:49:41 [scrapy.extensions.feedexport] INFO: Stored csv feed (25 items) in: items.csv
2019-04-01 16:49:41 [scrapy.utils.signal] ERROR: Error caught on signal handler: <bound method YahooScrapperSpider.close of <YahooScrapperSpider 'yahoo_scrapper' at 0x2c60f07bac8>>
Traceback (most recent call last):
  File "C:\Users\prasun.j\AppData\Local\Continuum\anaconda3\lib\site-packages\MySQLdb\cursors.py", line 201, in execute
    query = query % args
TypeError: not enough arguments for format string

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Users\prasun.j\AppData\Local\Continuum\anaconda3\lib\site-packages\twisted\internet\defer.py", line 151, in maybeDeferred
    result = f(*args, **kw)
  File "C:\Users\prasun.j\AppData\Local\Continuum\anaconda3\lib\site-packages\pydispatch\robustapply.py", line 55, in robustApply
    return receiver(*arguments, **named)
  File "C:\Users\prasun.j\Desktop\scrapping\scrapping\spiders\yahoo_scrapper.py", line 44, in close
    cursor.execute('INSERT IGNORE INTO yahoo_data (Title,Article) VALUES(%s, %s)', row)
  File "C:\Users\prasun.j\AppData\Local\Continuum\anaconda3\lib\site-packages\MySQLdb\cursors.py", line 203, in execute
    raise ProgrammingError(str(m))
MySQLdb._exceptions.ProgrammingError: not enough arguments for format string
2019-04-01 16:49:41 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 7985,
 'downloader/request_count': 27,
 'downloader/request_method_count/GET': 27,
 'downloader/response_bytes': 2148049,
 'downloader/response_count': 27,
 'downloader/response_status_count/200': 26,
 'downloader/response_status_count/301': 1,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2019, 4, 1, 11, 19, 41, 350717),
 'item_scraped_count': 25,
 'log_count/DEBUG': 53,
 'log_count/ERROR': 1,
 'log_count/INFO': 8,
 'request_depth_max': 1,
 'response_received_count': 26,
 'scheduler/dequeued': 27,
 'scheduler/dequeued/memory': 27,
 'scheduler/enqueued': 27,
 'scheduler/enqueued/memory': 27,
 'start_time': datetime.datetime(2019, 4, 1, 11, 19, 36, 743594)}
2019-04-01 16:49:41 [scrapy.core.engine] INFO: Spider closed (finished)

2 个答案:

答案 0 :(得分:0)

此错误

MySQLdb._exceptions.ProgrammingError: not enough arguments for format string

似乎是由于您所传递的行中缺少足够数量的参数所致。

您可以尝试打印该行,以了解问题所在。

无论如何,如果要将剪贴数据保存到数据库,我建议编写一个简单的项目管道,该管道将数据导出到数据库,而无需通过CSV。

有关项目管道的更多信息,请参见http://doc.scrapy.org/en/latest/topics/item-pipeline.html#topics-item-pipeline

您可以在Writing items to a MySQL database in Scrapy

找到有用的示例

答案 1 :(得分:0)

似乎您正在将列表传递给需要用逗号提及的参数

尝试将asterix添加到“行” var:

cursor.execute('INSERT IGNORE INTO yahoo_data (Title,Article) VALUES(%s, %s)', row)

收件人:

cursor.execute('INSERT IGNORE INTO yahoo_data (Title,Article) VALUES(%s, %s)', *row)