我试图在我的蜘蛛中实施this pipeline。 安装必要的依赖项之后,我能够运行蜘蛛而没有任何错误,但由于某种原因它不会写入我的数据库。
我很确定连接数据库时出现了问题。当我输入错误的密码时,我仍然没有收到任何错误。
当蜘蛛抓取所有数据时,它需要几分钟才开始倾倒统计数据。
2017-08-31 13:17:12 [scrapy] INFO: Closing spider (finished)
2017-08-31 13:17:12 [scrapy] INFO: Stored csv feed (27 items) in: test.csv
2017-08-31 13:24:46 [scrapy] INFO: Dumping Scrapy stats:
管道:
import MySQLdb.cursors
from twisted.enterprise import adbapi
from scrapy.xlib.pydispatch import dispatcher
from scrapy import signals
from scrapy.utils.project import get_project_settings
from scrapy import log
SETTINGS = {}
SETTINGS['DB_HOST'] = 'mysql.domain.com'
SETTINGS['DB_USER'] = 'username'
SETTINGS['DB_PASSWD'] = 'password'
SETTINGS['DB_PORT'] = 3306
SETTINGS['DB_DB'] = 'database_name'
class MySQLPipeline(object):
@classmethod
def from_crawler(cls, crawler):
return cls(crawler.stats)
def __init__(self, stats):
print "init"
#Instantiate DB
self.dbpool = adbapi.ConnectionPool ('MySQLdb',
host=SETTINGS['DB_HOST'],
user=SETTINGS['DB_USER'],
passwd=SETTINGS['DB_PASSWD'],
port=SETTINGS['DB_PORT'],
db=SETTINGS['DB_DB'],
charset='utf8',
use_unicode = True,
cursorclass=MySQLdb.cursors.DictCursor
)
self.stats = stats
dispatcher.connect(self.spider_closed, signals.spider_closed)
def spider_closed(self, spider):
print "close"
""" Cleanup function, called after crawing has finished to close open
objects.
Close ConnectionPool. """
self.dbpool.close()
def process_item(self, item, spider):
print "process"
query = self.dbpool.runInteraction(self._insert_record, item)
query.addErrback(self._handle_error)
return item
def _insert_record(self, tx, item):
print "insert"
result = tx.execute(
" INSERT INTO matches(type,home,away,home_score,away_score) VALUES (soccer,"+item["home"]+","+item["away"]+","+item["score"].explode("-")[0]+","+item["score"].explode("-")[1]+")"
)
if result > 0:
self.stats.inc_value('database/items_added')
def _handle_error(self, e):
print "error"
log.err(e)
蜘蛛:
import scrapy
import dateparser
from crawling.items import KNVBItem
class KNVBspider(scrapy.Spider):
name = "knvb"
start_urls = [
'http://www.knvb.nl/competities/eredivisie/uitslagen',
]
custom_settings = {
'ITEM_PIPELINES': {
'crawling.pipelines.MySQLPipeline': 301,
}
}
def parse(self, response):
# www.knvb.nl/competities/eredivisie/uitslagen
for row in response.xpath('//div[@class="table"]'):
for div in row.xpath('./div[@class="row"]'):
match = KNVBItem()
match['home'] = div.xpath('./div[@class="value home"]/div[@class="team"]/text()').extract_first()
match['away'] = div.xpath('./div[@class="value away"]/div[@class="team"]/text()').extract_first()
match['score'] = div.xpath('./div[@class="value center"]/text()').extract_first()
match['date'] = dateparser.parse(div.xpath('./preceding-sibling::div[@class="header"]/span/span/text()').extract_first(), languages=['nl']).strftime("%d-%m-%Y")
yield match
如果有更好的管道可以做我想要实现的目标,那也是受欢迎的。谢谢!
更新: 通过在接受的答案中提供的链接,我最终得到了这个功能(这解决了我的问题):
def process_item(self, item, spider):
print "process"
query = self.dbpool.runInteraction(self._insert_record, item)
query.addErrback(self._handle_error)
query.addBoth(lambda _: item)
return query
答案 0 :(得分:0)
如果您可以在输出中看到插件已经是一个好兆头。 我用这种方式重写了插入函数:
def _insert_record(self, tx, item):
print "insert"
raw_sql = "INSERT INTO matches(type,home,away,home_score,away_score) VALUES ('%s', '%s', '%s', '%s', '%s')"
sql = raw_sql % ('soccer', item['home'], item['away'], item['score'].explode('-')[0], item['score'].explode('-')[1])
print sql
result = tx.execute(sql)
if result > 0:
self.stats.inc_value('database/items_added')
它允许您调试您正在使用的sql。在您的版本中,您没有将字符串包装在'
中,这是mysql中的语法错误。
我不确定你的最后一个值(得分)所以我将它们视为字符串。
答案 1 :(得分:0)
查看at this了解如何使用adbapi与MySQL保存已删除的项目。请注意process_item
及其process_item
方法实施的差异。当您立即返回该项时,它们将返回Deferred
对象,该对象是runInteraction
方法的结果,并在完成后返回该项目。我认为这就是你_insert_record
永远不会被调用的原因。