我想删除[]括号,使scrapy添加到所有输出中,为此,您只需在xpath语句的末尾添加[0]即可,如下所示:
'a[@class="question-hyperlink"]/text()').extract()[0]
这在某些情况下解决了[]问题,但在其他情况下,scrapy将输出的第二行返回为空白,因此当使用[0]时,它到达第二行的那一刻出现了错误:>
Index error: list index out of range
如何防止scrapy创建空白行?看来这是一个常见问题,但是每个人在导出为CSV时都会遇到此问题,而对我来说,这是在导出为CSV之前出现的scrap回响。
Items.py:
import scrapy
from scrapy.item import Item, Field
class QuestionItem(Item):
title = Field()
url = Field()
class PopularityItem(Item):
votes = Field()
answers = Field()
views = Field()
class ModifiedItem(Item):
lastModified = Field()
modName = Field()
蜘蛛不会将第二行输出为空白,因此可以使用[0]:
from scrapy import Spider
from scrapy.selector import Selector
from stack.items import QuestionItem
class QuestionSpider(Spider):
name = "questions"
allowed_domains = ["stackoverflow.com"]
start_urls = [
"http://stackoverflow.com/questions?pagesize=50&sort=newest",
]
def parse(self, response):
questions = Selector(response).xpath('//div[@class="summary"]/h3')
for question in questions:
item = QuestionItem()
item['title'] = question.xpath(
'a[@class="question-hyperlink"]/text()').extract()[0]
item['url'] = question.xpath(
'a[@class="question-hyperlink"]/@href').extract()[0]
yield item
将第二行输出作为空白的蜘蛛程序
from scrapy import Spider
from scrapy.selector import Selector
from stack.items import PopularityItem
class PopularitySpider(Spider):
name = "popularity"
allowed_domains = ["stackoverflow.com"]
start_urls = [
"https://stackoverflow.com/",
]
def parse(self, response):
popularity = response.xpath('//div[contains(@class, "question-summary narrow")]/div')
for poppart in popularity:
item = PopularityItem()
item['votes'] = poppart.xpath(
'div[contains(@class, "votes")]//span/text()').extract()#[0]
item['answers'] = poppart.xpath(
'div[contains(@class, "answered")]//span/text()').extract()#[0]
item['views'] = poppart.xpath(
'div[contains(@class, "views")]//span/text()').extract()#[0]
yield item
Pipelines.py
import pymongo
import logging
class StackPipeline(object):
def process_item(self, item, spider):
return item
from scrapy.conf import settings
from scrapy.exceptions import DropItem
from scrapy import log
class MongoDBPipeline(object):
def __init__(self):
connection = pymongo.MongoClient(settings['MONGODB_SERVER'], settings['MONGODB_PORT'])
self.db = connection[settings['MONGODB_DB']]
def process_item(self, item, spider):
collection = self.db[type(item).__name__.lower()]
logging.info(collection.insert(dict(item)))
return item
答案 0 :(得分:0)
处理此类错误的最简单方法是捕获并处理错误(在这种情况下,只需越过空白行)。
class PopularitySpider(Spider):
name = "popularity"
allowed_domains = ["stackoverflow.com"]
start_urls = ["https://stackoverflow.com/"]
def parse(self, response):
popularity = response.xpath('//div[contains(@class, "question-summary narrow")]/div')
for poppart in popularity:
try:
item = PopularityItem()
item['votes'] = poppart.xpath('div[contains(@class, "votes")]//span/text()').extract()[0]
item['answers'] = poppart.xpath('div[contains(@class, "answered")]//span/text()').extract()[0]
item['views'] = poppart.xpath('div[contains(@class, "views")]//span/text()').extract()[0]
except IndexError:
continue
yield item