Scrapy蜘蛛不写在Postgres

时间:2017-10-23 11:29:46

标签: python postgresql scrapy scrapy-spider

我正在尝试将网页的几个页面中的项目废弃到postgres数据库。我尝试了不同的代码,但仍然无效,我的数据库仍然是空的......

如何将网站页面中的项目废弃到Postgres数据库? 我的代码出了什么问题?

我向您展示了最新版本的代码:

Myspider.py

#!/usr/bin/env python
#-*- coding: utf-8 -*-

import scrapy, os, re, csv
from scrapy.spiders import CrawlSpider, Rule, Spider
from scrapy.linkextractors import LinkExtractor
from scrapy.selector import Selector
from scrapy.loader import ItemLoader
from scrapy.loader.processors import Join, MapCompose
from scrapy.item import Item, Field
from AHOTU_V2.items import AhotuV2Item 

def url_lister():
    url_list = []
    page_count = 0
    while page_count < 10: 
        url = 'https://marathons.ahotu.fr/calendrier/?page=%s' %page_count
        url_list.append(url)
        page_count += 1 
    return url_list

class ListeCourse(CrawlSpider):
    name = 'ListeCAP_Marathons_ahotu' 
    start_urls = url_lister()

    deals_list_xpath='//div[@class="list-group col-sm-12 top-buffer"]/a[@class="list-group-item calendar"]' 

    item_fields = AhotuV2Item()

    item_fields = {
        'nom_course': './/dl/dd[3]/text()',
        'localisation' :'.//dl/dd[2]/span[1]/text()',
    }


    def parse_item(self, response):
        selector = Selector(response)

        # iterate over deals
        for deal in selector.xpath(self.deals_list_xpath):
            loader = ItemLoader(AhotuV2Item(), selector=deal)

            # define processors
            loader.default_input_processor = MapCompose(unicode.strip)
            loader.default_output_processor = Join()

            # iterate over fields and add xpaths to the loader
            for field, xpath in self.item_fields.iteritems():
                loader.add_xpath(field, xpath)
            yield loader.load_item()  

2 个答案:

答案 0 :(得分:0)

在寻找解决方案的几个小时后,我才意识到使用的方法是错误的,这就是蜘蛛无法工作的原因。

<强> MySpider.py

#!/usr/bin/env python
#-*- coding: utf-8 -*-

from scrapy.spiders import Spider
(...)

class ListeCourse(Spider):
(...)

答案 1 :(得分:0)

我没有看到任何调用parse_item

的规则

您的班级应该使用Spider而不是CrawlSpider。变化

class ListeCourse(CrawlSpider):

class ListeCourse(Spider):