尝试更新管道文件set_data_update函数中db中的列时出现错误。 我正在尝试使用get_data函数返回url和价格,对于返回的每个URL,请调用set_data_update函数,在其中我将现有的new_price交换为old_price,然后放入 新的报废价格变为new_price。似乎我对get_data中的set_data_update的调用总是运行两次。它应该运行一次,因为目前我在第二个URL的数据库中只有一行- “ https://www.amazon.com/Hamilton-Beach-46310-Programmable-Coffee/dp/B07684BPLB/ref=sr_1_10?keywords=coffee+maker&qid=1559098604&s=home-garden&sr=1-10”。
我也看到追溯错误
sqlite3.OperationalError:无法识别的令牌:“:”
products.json
{
"itemdata": [
{ "url": "https://www.amazon.com/dp/B07GWKT87L/?`coliid=I36XKNB8MLE3&colid=KRASGH7290D0&psc=0&ref_=lv_ov_lig_dp_it#customerReview",`
"title": "coffee_maker_black_and_decker",
"name": "Cobi Maguire",
"email": "cobi@noemail.com"
},
{ "url": "https://www.amazon.com/Hamilton-Beach-46310-Programmable-Coffee/dp/B07684BPLB/ref=sr_1_10?keywords=coffee+maker&qid=1559098604&s=home-garden&sr=1-10",
"title": "coffee_maker_hamilton_beach",
"name": "Ryan Murphy",
"email": "ryan@noemail.com"
}
]
}
错误回溯-回溯(最近一次通话为最后一次): (price_monitor)C:\ Users \ hassy \ Documents \ python_venv \ price_monitor \ price_monitor>抓取抓取price_monitor 2019-06-15 17:00:10 [scrapy.utils.log]信息:Scrapy 1.6.0已启动(bot:price_monitor) 2019-06-15 17:00:10 [scrapy.utils.log]信息:版本:lxml 4.3.3.0,libxml2 2.9.5,cssselect 1.0.3,parsel 1.5.1,w3lib 1.20.0,Twisted 19.2.0 ,Python 3.6.5(v3.6.5:f59c0932b4,Mar 28 2018,16:07:46)[MSC v.1900 32位(Intel)],pyOpenSSL 19.0.0(OpenSSL 1.1.1b 2019年2月26日),密码学2.6 .1,平台Windows-10-10.0.17134-SP0 2019-06-15 17:00:10 [scrapy.crawler]信息:覆盖的设置:{'BOT_NAME':'price_monitor','NEWSPIDER_MODULE':'price_monitor.spiders','ROBOTSTXT_OBEY':是,'SPIDER_MODULES':[ 'price_monitor.spiders'],'USER_AGENT':'用户代理:Mozilla / 5.0(Macintosh; Intel Mac OS X 10_13_6)AppleWebKit / 537.36(KHTML,例如Gecko)Chrome / 69.0.3497.100 Safari / 537.36'} 2019-06-15 17:00:10 [scrapy.extensions.telnet]信息:Telnet密码:3c0578dfed20521c 2019-06-15 17:00:10 [scrapy.middleware]信息:启用的扩展程序: ['scrapy.extensions.corestats.CoreStats', 'scrapy.extensions.telnet.TelnetConsole', 'scrapy.extensions.logstats.LogStats'] 2019-06-15 17:00:10 [scrapy.middleware]信息:启用下载器中间件: ['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware', 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware', “ scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware”, 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware', 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware', 'scrapy.downloadermiddlewares.retry.RetryMiddleware', 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware', “ scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware”, 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware', 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware', 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware', 'scrapy.downloadermiddlewares.stats.DownloaderStats'] 2019-06-15 17:00:10 [scrapy.middleware]信息:已启用蜘蛛中间件: ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware', 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware', 'scrapy.spidermiddlewares.referer.RefererMiddleware', 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware', 'scrapy.spidermiddlewares.depth.DepthMiddleware'] 2019-06-15 17:00:10 [scrapy.middleware]信息:启用的项目管道: ['price_monitor.pipelines.PriceMonitorPipeline'] 2019-06-15 17:00:10 [scrapy.core.engine]信息:蜘蛛开了 2019-06-15 17:00:10 [scrapy.extensions.logstats]信息:抓取0页(以0页/分钟),抓取0件(以0件/分钟) 2019-06-15 17:00:10 [scrapy.extensions.telnet]信息:Telnet控制台正在侦听127.0.0.1:6023 2019-06-15 17:00:11 [scrapy.core.engine]调试:爬行(200)https://www.amazon.com/robots.txt>(引用:无) 2019-06-15 17:00:11 [scrapy.downloadermiddlewares.redirect]调试:从https重定向(301)到https://www.amazon.com/BLACK-DECKER-CM4202S-Programmable-Coffeemaker/dp/B07GWKT87L> ://www.amazon.com/dp/B07GWKT87L/?coliid = I36XKNB8MLE3&colid = KRASGH7290D0&psc = 0&ref_ = lv_ov_lig_dp_it#customerReview> 2019-06-15 17:00:11 [scrapy.downloadermiddlewares.redirect]调试:从https重定向(301)到https://www.amazon.com/Hamilton-Beach-46310-Programmable-Coffee/dp/B07684BPLB> ://www.amazon.com/Hamilton-Beach-46310-Programmable-Coffee/dp/B07684BPLB/ref=sr_1_10?keywords = coffee + maker&qid = 1559098604&s = home-garden&sr = 1-10> 2019-06-15 17:00:12 [scrapy.core.engine]调试:爬网(200)https://www.amazon.com/BLACK-DECKER-CM4202S-Programmable-Coffeemaker/dp/B07GWKT87L>(参考:没有) 2019-06-15 17:00:12 [scrapy.core.engine]调试:爬行(200)https://www.amazon.com/Hamilton-Beach-46310-Programmable-Coffee/dp/B07684BPLB>(参考:没有) 打印行 ('https://www.amazon.com/Hamilton-Beach-46310-Programmable-Coffee/dp/B07684BPLB/ref=sr_1_10?keywords=coffee+maker&qid=1559098604&s=home-garden&sr=1-10','$ 37.99') 呼叫功能 2019-06-15 17:00:12 [scrapy.core.scraper]错误:错误处理{'email':'ryan@noemail.com', 'name':'Ryan Murphy', '价格':'$ 49.99', 'title':'BLACK + DECKER CM4202S可选A尺寸简易拨号' “咖啡壶,超大容量80盎司,不锈钢”, 'url':'h'} 追溯(最近一次通话): _runCallbacks中的第654行“ c:\ users \ hassy \ documents \ python_venv \ price_monitor \ lib \ site-packages \ twisted \ internet \ defer.py” current.result =回调(current.result,* args,** kw) 文件“ c:\ users \ hassy \ documents \ python_venv \ price_monitor \ price_monitor \ pipelines.py”,第37行,在process_item中 self.get_data(item) get_data中的第60行的文件“ c:\ users \ hassy \ documents \ python_venv \ price_monitor \ price_monitor \ pipelines.py” self.set_data_update(项目,网址,新价格) set_data_update中第88行的文件“ c:\ users \ hassy \ documents \ python_venv \ price_monitor \ price_monitor \ pipelines.py” {'old_price':old_price,'new_price':item ['price']}) sqlite3.OperationalError:无法识别的令牌:“:” 打印行 ('https://www.amazon.com/Hamilton-Beach-46310-Programmable-Coffee/dp/B07684BPLB/ref=sr_1_10?keywords=coffee+maker&qid=1559098604&s=home-garden&sr=1-10','$ 37.99') 呼叫功能 2019-06-15 17:00:12 [scrapy.core.scraper]错误:错误处理{'email':'ryan@noemail.com', 'name':'Ryan Murphy', '价格':'$ 34.99', 'title':'Hamilton Beach 46310可编程咖啡机,12杯,黑色,' 'url':'h'} 追溯(最近一次通话): _runCallbacks中的第654行“ c:\ users \ hassy \ documents \ python_venv \ price_monitor \ lib \ site-packages \ twisted \ internet \ defer.py” current.result =回调(current.result,* args,** kw) 文件“ c:\ users \ hassy \ documents \ python_venv \ price_monitor \ price_monitor \ pipelines.py”,第37行,在process_item中 self.get_data(item) get_data中的第60行的文件“ c:\ users \ hassy \ documents \ python_venv \ price_monitor \ price_monitor \ pipelines.py” self.set_data_update(项目,网址,新价格) set_data_update中第88行的文件“ c:\ users \ hassy \ documents \ python_venv \ price_monitor \ price_monitor \ pipelines.py” {'old_price':old_price,'new_price':item ['price']}) sqlite3.OperationalError:无法识别的令牌:“:” 2019-06-15 17:00:12 [scrapy.core.engine]信息:关闭蜘蛛(已完成) 2019-06-15 17:00:12 [scrapy.statscollectors]信息:倾销Scrapy统计信息: {'downloader / request_bytes':1888, “ downloader / request_count”:5 'downloader / request_method_count / GET':5, 'downloader / response_bytes':261495, 'downloader / response_count':5 “ downloader / response_status_count / 200”:3, 'downloader / response_status_count / 301':2 'finish_reason':'完成', 'finish_time':datetime.datetime(2019,6,15,15,21,0,12,534906), 'log_count / DEBUG':5 'log_count / ERROR':2 'log_count / INFO':9, 'response_received_count':3, 'robotstxt / request_count':1, 'robotstxt / response_count':1, 'robotstxt / response_status_count / 200':1, “调度程序/出队”:4 “调度程序/出队/内存”:4 “调度程序/排队”:4 “调度程序/排队/内存”:4 'start_time':datetime.datetime(2019、6、15、21、0、10、799145)} 2019-06-15 17:00:12 [scrapy.core.engine]信息:蜘蛛关闭了(完成)
(price_monitor) C:\Users\hassy\Documents\python_venv\price_monitor\price_monitor>
pipelines.py
import sqlite3
class PriceMonitorPipeline(object):
def __init__(self):
self.create_connection()
self.create_table()
def create_connection(self):
self.conn = sqlite3.connect("price_monitor.db")
self.curr = self.conn.cursor()
def process_item(self, item, spider):
# self.store_data(item)
print("printing items")
print(item['title'])
print(item['price'])
self.get_data(item)
return item
def get_data(self, item):
""" Check if the row already exists for this url """
rows = 0
url = ''
new_price = ''
self.rows = rows
self.url = url
self.new_price = new_price
self.curr.execute("""select url, new_price from price_monitor WHERE url =:url""",
{'url': item['url']})
rows = self.curr.fetchone()
print("Printing rows")
print(rows)
rows_url = rows[0]
new_price = rows[1]
if rows is not None:
for item['url'] in rows_url:
print("calling func")
self.set_data_update(item, url, new_price)
else:
pass
def set_data_update(self, item, url, new_price):
url = 'https://www.amazon.com/Hamilton-Beach-46310-Programmable-Coffee/dp/B07684BPLB/ref=sr_1_10?keywords=coffee+maker&qid=1559098604&s=home-garden&sr=1-10'
old_price = new_price
price = item['price']
print("printing old price")
print(old_price)
print("New Price".format(item['price']))
self.curr.execute("""update price_monitor SET old_price=: old_price, new_price=: new_price
WHERE url=: url""",
{'old_price': old_price, 'new_price': price})
self.conn.commit()
items.py
import scrapy
class AmazonItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
url = scrapy.Field()
title = scrapy.Field()
price = scrapy.Field()
name = scrapy.Field()
email = scrapy.Field()
蜘蛛
import scrapy
import json
import sys
from ..items import AmazonItem
class MySpider(scrapy.Spider):
name = 'price_monitor'
newlist = []
start_urls = []
itemdatalist = []
with open('C:\\Users\\hassy\\Documents\\python_venv\\price_monitor\\price_monitor\\products.json') as f:
data = json.load(f)
itemdatalist = data['itemdata']
for item in itemdatalist:
start_urls.append(item['url'])
def start_requests(self):
for item in MySpider.start_urls:
yield scrapy.Request(url=item, callback=self.parse)
def parse(self, response):
for url in MySpider.start_urls:
scrapeitem = AmazonItem()
title = response.css('span#productTitle::text').extract_first()
title = title.strip()
price = response.css('span#priceblock_ourprice::text').extract_first()
scrapeitem['title'] = title
scrapeitem['price'] = price
for item in MySpider.data['itemdata']:
url = item['url']
name = item['name']
email = item['email']
scrapeitem['url'] = url
scrapeitem['name'] = name
scrapeitem['email'] = email
yield scrapeitem