Scrapy forward 填充缺失值

时间:2021-01-24 11:32:46

标签: python web-scraping scrapy

我正在尝试使用 Scrapy Spider 抓取网站,目前我的代码如下所示:

import scrapy
from ..items import FotocasascraperItem
import numpy as np

class FotoCasaSpider(scrapy.Spider):
    name = 'FotoCasa'
    allowed_domains = ['fotocasa.es']
    start_urls = ['https://www.fotocasa.es/es/alquiler/viviendas/madrid-capital/todas-las-zonas/l?latitude=40.4096&longitude=-3.6862&combinedLocationIds=724,14,28,173,0,28079,0,0,0']

    def parse(self, response):
        # Getting the items
        items = FotocasascraperItem()
        print("There are ", len(items), " items in the dictionary")

        # Looping over each property
        property_cards = response.xpath('//article//a[contains(@class,"re-Card-link")]')
    
        for card in property_cards:            
            # Url
            url = card.css('::attr(href)').get()
    
            # Entering the url
            yield response.follow(url, callback = self.parse_rent, cb_kwargs=dict(items = items))   


    def parse_rent(self, response, items):
        # Initial features
        features = response.css('.re-DetailHeader-featuresItemIcon+ span')
        feature_text = features.xpath('./text()').extract()
        feature_text = [feature.replace(" ","").replace("s","") for feature in feature_text]
        feature_value = features.xpath('./span/text()').extract()
        ## Floor
        try:
            index = feature_text.index("Planta")
            items["Floor"] = feature_value[index]
        except: 
            pass
        features = np.nan 
        feature_value = np.nan
        yield items

这个想法是代码蜘蛛访问第一页中的每个项目并获取我需要的信息。但是,某些项目中缺少某些功能。我面临的问题是 Scrapy 会自动用前一个值填充缺失值。

如您所见,我已尝试在 feature_value 函数的开头和结尾重新启动 featuresparse_rent 的值,但这似乎不起作用.我也尝试过实现 for 循环来填充每个项目,但这也不起作用。

我对 Scrapy 很陌生,我错过了什么吗?或者这是 Scrapy 经常做的事情?

0 个答案:

没有答案