我开始学习scrapy。我想使用项加载器并将一些数据写入MySQL。当我在items.py中为输出处理器使用参数“TakeFirst()”时,下面的代码完全正常。但是,我需要获取MySQL的所有条目,而不仅仅是第一个。当我使用参数“MapCompose()”时,我得到以下与MySQL相关的错误消息:
错误1241:操作数应包含1列
如何修改代码以将所有条目写入MySQL?
test_crawlspider.py:
import scrapy
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from tutorial.items import TestItem
from scrapy.loader import ItemLoader
class TestCrawlSpider(CrawlSpider):
name = "test_crawl"
allowed_domains = ["www.immobiliare.it"]
start_urls = [
"http://www.immobiliare.it/Roma/case_in_vendita-Roma.html?criterio=rilevanza"
]
rules = (
Rule(SgmlLinkExtractor(allow=(), restrict_xpaths=('//a[@class="no-decoration button next_page_act"]',)), callback="parse_start_url", follow= True),
)
handle_httpstatus_list = [302]
def parse_start_url(self, response):
l = ItemLoader(item=TestItem(), response=response)
l.add_xpath('price', '//*/div[1]/div[1]/div[4]/strong/text()')
l.add_xpath('rooms', '//*/div[1]/div[1]/div[7]/div[1]/span[4]/text()')
return l.load_item()
items.py:
import scrapy
from scrapy.loader import ItemLoader
from scrapy.loader.processors import TakeFirst, MapCompose, Join
class TestItem(scrapy.Item):
price = scrapy.Field(
output_processor=TakeFirst(),
)
rooms = scrapy.Field(
output_processor=TakeFirst(),
)
pipelines.py:
import sys
import MySQLdb
import hashlib
from scrapy.http import Request
from tutorial.items import TestItem
class MySQLPipeline(object):
def __init__(self):
self.conn = MySQLdb.connect(user='XXX', passwd='YYY', host='localhost', db='ZZZ')
self.cursor = self.conn.cursor()
def process_item(self, item, test_crawl):
print item
return item
try:
self.cursor.execute("INSERT INTO test_table (price, rooms) VALUES (%s, %s)", (item['price'], item['rooms']))
self.conn.commit()
except MySQLdb.Error, e:
print "Error %d: %s" % (e.args[0], e.args[1])
return item
答案 0 :(得分:2)
你需要将string
值传递给mysql,这就是TakeFirst()
工作的原因,因为它会转换你在加载器中得到的列表,并且只获取第一个元素(这是正常的过程,因为它通常会得到像['myvalue']
这样的工作人员,在这种情况下只能得到第一个元素。
现在,如果要在数据库中输入列表,请说['a', 'b', 'c']
,您需要定义如何将其序列化为字符串,例如:
'a;b;c' # join the list elements with ';' -> ';'.join(['a', 'b', 'c'])
这是您需要定义的内容,因为稍后在从数据库查询时,您必须相应地对其进行desearilize:
'a;b;c'.split(';') -> ['a' ,'b', 'c']
要使用我的示例,您可以在加载程序中使用类似的内容:
class TestItem(scrapy.Item):
price = scrapy.Field(
output_processor=Join(';'),
)
rooms = scrapy.Field(
output_processor=Join(';'),
)
答案 1 :(得分:1)
您需要在列表中为每个条目创建一个项目,如下所示:
import scrapy
from scrapy.loader import ItemLoader
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
class TestCrawlSpider(CrawlSpider):
name = "test_crawl"
allowed_domains = ["www.immobiliare.it"]
start_urls = [
"http://www.immobiliare.it/Roma/case_in_vendita-Roma.html?criterio=rilevanza"
]
rules = (
Rule(LinkExtractor(allow=(), restrict_xpaths=('//a[@class="no-decoration button next_page_act"]',)), callback="parse_start_url", follow= True),
)
handle_httpstatus_list = [302]
def parse_start_url(self, response):
for selector in response.css('div.content'):
l = ItemLoader(item=TestItem(), selector=selector)
l.add_css('price', '.price::text')
l.add_css('rooms', '.bottom::text, .bottom span::text', re=r'.*locali.*')
yield l.load_item()
我已经改变了一些选择器,以便你可以检查其他可能性(同时学习scrapy),但也许这不是你想要提取的信息。