我正在使用Scrapy BaseSpider从网站收集数据。刮刀从产品显示页面开始,在“下一页”链接上移动,从每个页面收集某些数据并将其存储到CSV文件中。蜘蛛正确运行,但仅从第1页,第2页和最后一页(第36页)收集数据。经过几个小时修补代码后,我无法弄清楚原因。以下代码显示了我的蜘蛛。有什么建议吗?
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.http.request import Request
from longs.items import LongsItem
from scrapy.utils.response import get_base_url
import urlparse
class LongsComSpider(BaseSpider):
name = "longs"
allowed_domains = ["longswines.com"]
start_urls = ["http://www.longswines.com/wines/?page=3&sortby=winery&item_type=wine"]
def parse(self, response):
hxs = HtmlXPathSelector(response)
sites = hxs.select("//div[@class='pagebox']/a[6]/@href")
for site in sites:
relative_next_page = site.extract()
next_page = [urlparse.urljoin(response.url, relative_next_page)]
if not not relative_next_page:
yield Request(next_page[0], self.parse)
products = hxs.select("//div[@class='productlistitem']")
items = []
for product in products:
item = LongsItem()
item["title"] = product.select("div[1]/h2/a/text()").extract()
item["link"] = response.url
item["price"] = product.select("div[2]/h2/text()").extract()
item["details"] = product.select("div[1]/p/text()").extract()
items.append(item)
for item in items:
yield item
答案 0 :(得分:0)
我认为你在这一行有问题
if not not relative_next_page:
你有两个not