Scrapy Pipeline SQL语法错误

时间:2017-04-13 16:40:45

标签: python scrapy scrapy-pipeline

我有一个蜘蛛从MySQL数据库中抓取URL并使用这些URL作为start_urls来抓取,这反过来又从抓取的页面中抓取任意数量的新链接。当我将管道设置为INSERT时,将start_url和新的已删除URL都插入新数据库,或者当我使用start_url作为WHERE条件将管道设置为使用新删除的URL更新现有数据库时,我收到SQL语法错误。 / p>

当我只插入一个或另一个时,我没有收到错误。

这是spider.py

import scrapy
import MySQLdb
import MySQLdb.cursors
from scrapy.http.request import Request

from youtubephase2.items import Youtubephase2Item

class youtubephase2(scrapy.Spider):
    name = 'youtubephase2'

    def start_requests(self):
        conn = MySQLdb.connect(user='uname', passwd='password', db='YouTubeScrape', host='localhost', charset="utf8", use_unicode=True)
        cursor = conn.cursor()
        cursor.execute('SELECT resultURL FROM SearchResults;')
        rows = cursor.fetchall()

        for row in rows:
            if row:
                yield Request(row[0], self.parse, meta=dict(start_url=row[0]))
        cursor.close()

    def parse(self, response):
        for sel in response.xpath('//a[contains(@class, "yt-uix-servicelink")]'):
            item = Youtubephase2Item()
            item['newurl'] = sel.xpath('@href').extract()
            item['start_url'] = response.meta['start_url']
            yield item

这是pipeline.py,它显示了所有三个self.cursor.execute语句

import MySQLdb
import MySQLdb.cursors
import hashlib
from scrapy import log
from scrapy.exceptions import DropItem
from twisted.enterprise import adbapi
from youtubephase2.items import Youtubephase2Item

class MySQLStorePipeline(object):
    def __init__(self):
        self.conn = MySQLdb.connect(user='uname', passwd='password', db='YouTubeScrape', host='localhost', charset="utf8", use_unicode=True)
        self.cursor = self.conn.cursor()

    def process_item(self, item, spider):
        try:

            #self.cursor.execute("""UPDATE SearchResults SET NewURL = %s WHERE ResultURL = %s VALUES (%s, %s)""",(item['newurl'], item['start_url']))
            #self.cursor.execute("""UPDATE SearchResults SET NewURL = %s WHERE ResultURL = %s""",(item['newurl'], item['start_url']))
            self.cursor.execute("""INSERT INTO TestResults (NewURL, StartURL) VALUES (%s, %s)""",(item['newurl'], item['start_url']))
            self.conn.commit()


        except MySQLdb.Error, e:
            log.msg("Error %d: %s" % (e.args[0], e.args[1]))

        return item

最顶层的SQL execute语句返回此错误:

2017-04-13 18:29:34 [scrapy.core.scraper] ERROR: Error processing {'newurl': [u'http://www.tagband.co.uk/'],
 'start_url': u'https://www.youtube.com/watch?v=UqguztfQPho'}
Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/twisted/internet/defer.py", line 653, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "/root/scraping/youtubephase2/youtubephase2/pipelines.py", line 18, in process_item
self.cursor.execute("""UPDATE SearchResults SET AffiliateURL = %s WHERE ResultURL = %s VALUES (%s, %s)""",(item['affiliateurl'], item['start_url']))
File "/usr/lib/python2.7/dist-packages/MySQLdb/cursors.py", line 159, in execute
query = query % db.literal(args)
TypeError: not enough arguments for format string

中间的SQL execute语句返回此错误:

    2017-04-13 18:33:18 [scrapy.log] INFO: Error 1064: You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near ') WHERE ResultURL = 'https://www.youtube.com/watch?v=UqguztfQPho'' at line 1
2017-04-13 18:33:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.youtube.com/watch?v=UqguztfQPho>
{'newurl': [u'http://www.tagband.co.uk/'],
 'start_url': u'https://www.youtube.com/watch?v=UqguztfQPho'}

即使将INSERT用于新数据库,最后一个SQL execute语句也会返回与中间相同的错误。似乎添加额外的单引号。当我只将其中一个项目插入数据库时​​,最后一个工作。

    2017-04-13 18:36:40 [scrapy.log] INFO: Error 1064: You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near '), 'https://www.youtube.com/watch?v=UqguztfQPho')' at line 1
2017-04-13 18:36:40 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.youtube.com/watch?v=UqguztfQPho>
{'newurl': [u'http://www.tagband.co.uk/'],
'start_url': u'https://www.youtube.com/watch?v=UqguztfQPho'}

很抱歉这篇长篇文章。试图彻底。

1 个答案:

答案 0 :(得分:0)

我想出来了。问题与我将列表传递给MySQL执行管道这一事实有关。

我创建了一个管道,将列表转换为带有&#34;&#34; .join(item [&#39; newurl&#39;])的字符串,并在点击MySQL管道之前返回该项目。 / p>

也许有更好的方法来更改spider.py中的项目[&#39; newurl&#39;] = sel.xpath(&#39; @ href&#39;)。extract()行以提取列表中的第一项或将其转换为文本但管道为我工作。