我正在编写Scrapy爬虫,我希望它将数据发送到数据库。但我不能让它工作,也许是因为管道。这是我的蜘蛛:
from scrapy.contrib.spiders import CrawlSpider
from scrapy.selector import Selector
from scrapy.http import Request
class YourCrawler(CrawlSpider):
name = "bookstore"
start_urls = [
'https://example.com/materias/?novedades=LC&p',
]
allowed_domains = ["example.com"]
def parse(self, response):
# go to the urls in the list
s = Selector(response)
page_list_urls = s.xpath('///*[@id="results"]/ul/li/div[1]/h4/a[2]/@href').extract()
for url in page_list_urls:
yield Request(response.urljoin(url), callback=self.parse_following_urls, dont_filter=True)
# Return back and go to bext page in div#paginat ul li.next a::attr(href) and begin again
next_page = response.css('div#paginat ul li.next a::attr(href)').extract_first()
if next_page is not None:
next_page = response.urljoin(next_page)
yield Request(next_page, callback=self.parse)
#Dont know if this has to go here
if not s.select('//*[@id="logo"]/a/img'):
yield Request(url=response.url, dont_filter=True)
# For the urls in the list, go inside, and in div#main, take the div.ficha > div.caracteristicas > ul > li
def parse_following_urls(self, response):
#Parsing rules go here
for each_book in response.css('div#main'):
yield {
'book_isbn': each_book.css('div.ficha > div.caracteristicas > ul > li').extract(),
}
custom_settings = {
"DOWNLOAD_DELAY": 5,
"CONCURRENT_REQUESTS_PER_DOMAIN": 2
}
我希望它将数据发送到数据库,所以在pypelines.py中我有
import pymysql
from scrapy.exceptions import DropItem
from scrapy.http import Request
class to_mysql(object):
def __init__(self):
self.connection = pymysql.connect("***","***","***","***", charset="utf8", use_unicode=True)
self.cursor = self.connection.cursor()
def process_item(self, item, spider):
self.cursor.execute("INSERT INTO _b (book_isbn) VALUES (%s)", (item['book_isbn'].encode('utf-8')))
self.connection.commit()
return item
def close_spider(self, spider):
self.cursor.close()
self.connection.close()
并在settings.py
中ITEM_PIPELINES = {
'bookstore.pipelines.BookstorePipeline': 300,
'bookstore.pipelines.to_mysql': 300,
}
如果我在settings.py中激活管道«to_mysql»它不起作用,并返回此回溯:
Traceback (most recent call last):
File "/usr/local/lib/python2.7/site-packages/twisted/internet/defer.py", line 653, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "/Users/***/scrapy/bookstore/bookstore/pipelines.py", line 27, in process_item
self.cursor.execute("INSERT INTO _b (book_isbn) VALUES (%s)", (item['book_isbn'].encode('utf-8')))
AttributeError: 'list' object has no attribute 'encode'
2017-07-09 16:19:48 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://example.com/book/?id=9788416495412> (referer: https://example.com/materias/?novedades=LC&p)
2017-07-09 16:19:48 [scrapy.core.scraper] ERROR: Error processing {'book_isbn': [u'<li>Editorial: <a href="/search/avanzada/?go=1&editorial=Galaxia%20Gutenberg">Galaxia Gutenberg</a></li>', u'<li>P\xe1ginas: 325</li>', u'<li>A\xf1o: 2017</li>', u'<li>Precio: 21.90 \u20ac</li>', u'<li>Traductor: Pablo Moreno</li>', u'<li>EAN: 9788416495412</li>']}
有关为什么会发生这种情况的任何想法?
答案 0 :(得分:1)
这是因为您在book_isbn
字段上返回了一个列表值,因为.extract()
返回一个列表,并且列表无法编码到SQL查询中。
您必须序列化该值,或者您可能不想要列表,在这种情况下使用extract_first()
。