我正在尝试使用以下方案抓取网站:
我有一个mysql表,其中包含有关电影名称及其发行年份的信息。 Scrapy spider在start_requests函数中获取这两个值,然后处理请求。 search_in_filmweb
函数分析响应并检查,哪个结果包含与我从数据库中获得的发布年份相同的发布年份。
考虑我的数据库中有一个值如下:
movie_name:威尼斯的死亡; release_year:1971
蜘蛛发送请求为:http://www.filmweb.pl/search?q=Death+in+Venice,然后在发布日期之前选择正确的结果。
我编写的蜘蛛工作正常,但仅适用于数据库中的一个特定记录(作为BaseSpider)。但是,当我尝试从数据库中获取所有行时发出批量请求时出现错误:
2014-03-07 18:01:19+0100 [single] DEBUG: Crawled (200) <GET http://www.filmweb.pl/search?q=Death+in+Venice> (referer: None)
2014-03-07 18:01:19+0100 [single] ERROR: Spider error processing <GET http://www.filmweb.pl/search?q=Death+in+Venice>
Traceback (most recent call last):
File "/Library/Python/2.7/site-packages/twisted/internet/base.py", line 824, in runUntilCurrent
call.func(*call.args, **call.kw)
File "/Library/Python/2.7/site-packages/twisted/internet/task.py", line 638, in _tick
taskObj._oneWorkUnit()
File "/Library/Python/2.7/site-packages/twisted/internet/task.py", line 484, in _oneWorkUnit
result = next(self._iterator)
File "/Library/Python/2.7/site-packages/scrapy/utils/defer.py", line 57, in <genexpr>
work = (callable(elem, *args, **named) for elem in iterable)
--- <exception caught here> ---
File "/Library/Python/2.7/site-packages/scrapy/utils/defer.py", line 96, in iter_errback
yield next(it)
File "/Library/Python/2.7/site-packages/scrapy/contrib/spidermiddleware/offsite.py", line 23, in process_spider_output
for x in result:
File "/Library/Python/2.7/site-packages/scrapy/contrib/spidermiddleware/referer.py", line 22, in <genexpr>
return (_set_referer(r) for r in result or ())
File "/Library/Python/2.7/site-packages/scrapy/contrib/spidermiddleware/urllength.py", line 33, in <genexpr>
return (r for r in result or () if _filter(r))
File "/Library/Python/2.7/site-packages/scrapy/contrib/spidermiddleware/depth.py", line 50, in <genexpr>
return (r for r in result or () if _filter(r))
File "/Users/mikolajroszkowski/Desktop/python/scrapy_projects/filmweb_moviecus/filmweb_moviecus/spiders/single.py", line 37, in search_in_filmweb
yield Request("http://www.filmweb.pl"+item['link_from_search'][0], meta={'item': item}, callback=self.parse)
exceptions.IndexError: list index out of range
蜘蛛代码:
from scrapy.spider import Spider
from scrapy.selector import Selector
from filmweb_moviecus.items import FilmwebItem
from scrapy.http import Request
import MySQLdb
import urllib
class MySpider(Spider):
name = 'single'
allowed_domains = ['filmweb.pl']
def start_requests(self):
item = FilmwebItem()
conn = MySQLdb.connect(unix_socket = '/Applications/MAMP/tmp/mysql/mysql.sock', user='root', passwd='root', db='filmypodobne', host='localhost', charset="utf8", use_unicode=True)
cursor = conn.cursor()
cursor.execute("SELECT * FROM filmy_app_movies")
rows = cursor.fetchall()
for row in rows:
item['movie_name'] = urllib.quote_plus(row[1])
item['id_db'] = row[0]
item['db_year'] = row[3]
#print row[1]
#print self.db_year
yield Request("http://www.filmweb.pl/search?q="+item['movie_name'], meta={'item': item}, callback=self.search_in_filmweb)
def search_in_filmweb(self, response):
sel = Selector(response)
item = response.request.meta['item']
item['link_from_search'] = sel.xpath('//a[following-sibling::span[contains(.,"%s")]]/@href'%item['db_year']).extract()
yield Request("http://www.filmweb.pl"+item['link_from_search'][0], meta={'item': item}, callback=self.parse)
def parse(self, response):
sel = Selector(response)
item = response.request.meta['item']
item['tytul_pl'] = sel.xpath('//div[@class="filmTitle"]/div/h1/a/@title').extract()
item['tytul_obcy'] = sel.xpath('//div[@class="filmTitle"]/h2/text()').extract()
item['czas_trwania'] = sel.xpath('//div[@class="filmTime"]/text()').extract()
yield Request("http://www.filmweb.pl"+item['link_from_search'][0]+'/descs', meta={'item': item}, callback=self.parse_opis)
def parse_opis(self, response):
sel = Selector(response)
item = response.request.meta['item']
item['opis'] = sel.xpath('//div[@class="pageBox"]//p[@class="text"][1]/text()').extract()
return item
答案 0 :(得分:0)
愚蠢的错误,应该是:
def start_requests(self):
conn = MySQLdb.connect(unix_socket = '/Applications/MAMP/tmp/mysql/mysql.sock', user='root', passwd='root', db='filmypodobne', host='localhost', charset="utf8", use_unicode=True)
cursor = conn.cursor()
cursor.execute("SELECT * FROM filmy_app_movies")
rows = cursor.fetchall()
for row in rows:
item = FilmwebItem()
item['movie_name'] = urllib.quote_plus(row[1])
item['id_db'] = row[0]
item['db_year'] = row[3]
yield Request("http://www.filmweb.pl/search?q="+item['movie_name'], meta={'item': item}, callback=self.search_in_filmweb)