粗糙的响应统一空白行,因此无法格式化响应输出

时间:2018-10-10 20:08:00

标签: python scrapy

我想删除[]括号,使scrapy添加到所有输出中,为此,您只需在xpath语句的末尾添加[0]即可,如下所示:

'a[@class="question-hyperlink"]/text()').extract()[0]

这在某些情况下解决了[]问题,但在其他情况下,scrapy将输出的第二行返回为空白,因此当使用[0]时,它到达第二行的那一刻出现了错误:

Index error: list index out of range

如何防止scrapy创建空白行?看来这是一个常见问题,但是每个人在导出为CSV时都会遇到此问题,而对我来说,这是在导出为CSV之前出现的scrap回响。

Items.py:

import scrapy
from scrapy.item import Item, Field


class QuestionItem(Item):
    title = Field()
    url = Field()

class PopularityItem(Item):
    votes = Field()
    answers = Field()
    views = Field()


class ModifiedItem(Item):
    lastModified = Field()
    modName = Field()

蜘蛛不会将第二行输出为空白,因此可以使用[0]:

from scrapy import Spider
from scrapy.selector import Selector

from stack.items import QuestionItem

class QuestionSpider(Spider):
    name = "questions"
    allowed_domains = ["stackoverflow.com"]
    start_urls = [
        "http://stackoverflow.com/questions?pagesize=50&sort=newest",
    ]

    def parse(self, response):
        questions = Selector(response).xpath('//div[@class="summary"]/h3')

        for question in questions:
            item = QuestionItem()
            item['title'] = question.xpath(
                'a[@class="question-hyperlink"]/text()').extract()[0]
            item['url'] = question.xpath(
                'a[@class="question-hyperlink"]/@href').extract()[0]
            yield item

将第二行输出作为空白的蜘蛛程序

from scrapy import Spider
from scrapy.selector import Selector

from stack.items import PopularityItem


class PopularitySpider(Spider):
    name = "popularity"
    allowed_domains = ["stackoverflow.com"]
    start_urls = [
        "https://stackoverflow.com/",
    ]

    def parse(self, response):
        popularity = response.xpath('//div[contains(@class, "question-summary narrow")]/div')

        for poppart in popularity:

            item = PopularityItem()
            item['votes'] = poppart.xpath(
                'div[contains(@class, "votes")]//span/text()').extract()#[0]
            item['answers'] = poppart.xpath(
                'div[contains(@class, "answered")]//span/text()').extract()#[0]
            item['views'] = poppart.xpath(
                'div[contains(@class, "views")]//span/text()').extract()#[0]
            yield item

Pipelines.py

import pymongo
import logging

class StackPipeline(object):
    def process_item(self, item, spider):
        return item



from scrapy.conf import settings
from scrapy.exceptions import DropItem
from scrapy import log

class MongoDBPipeline(object):


    def __init__(self):

        connection = pymongo.MongoClient(settings['MONGODB_SERVER'], settings['MONGODB_PORT'])
        self.db = connection[settings['MONGODB_DB']]

    def process_item(self, item, spider):
        collection = self.db[type(item).__name__.lower()]
        logging.info(collection.insert(dict(item)))
        return item

1 个答案:

答案 0 :(得分:0)

处理此类错误的最简单方法是捕获并处理错误(在这种情况下,只需越过空白行)。

class PopularitySpider(Spider):
    name = "popularity"
    allowed_domains = ["stackoverflow.com"]
    start_urls = ["https://stackoverflow.com/"]

    def parse(self, response):
        popularity = response.xpath('//div[contains(@class, "question-summary narrow")]/div')
        for poppart in popularity:
            try:
                item = PopularityItem()
                item['votes'] = poppart.xpath('div[contains(@class, "votes")]//span/text()').extract()[0]
                item['answers'] = poppart.xpath('div[contains(@class, "answered")]//span/text()').extract()[0]
                item['views'] = poppart.xpath('div[contains(@class, "views")]//span/text()').extract()[0]
            except IndexError:
                continue
            yield item