我正在尝试使用 Scrapy Spider 抓取网站,目前我的代码如下所示:
import scrapy
from ..items import FotocasascraperItem
import numpy as np
class FotoCasaSpider(scrapy.Spider):
name = 'FotoCasa'
allowed_domains = ['fotocasa.es']
start_urls = ['https://www.fotocasa.es/es/alquiler/viviendas/madrid-capital/todas-las-zonas/l?latitude=40.4096&longitude=-3.6862&combinedLocationIds=724,14,28,173,0,28079,0,0,0']
def parse(self, response):
# Getting the items
items = FotocasascraperItem()
print("There are ", len(items), " items in the dictionary")
# Looping over each property
property_cards = response.xpath('//article//a[contains(@class,"re-Card-link")]')
for card in property_cards:
# Url
url = card.css('::attr(href)').get()
# Entering the url
yield response.follow(url, callback = self.parse_rent, cb_kwargs=dict(items = items))
def parse_rent(self, response, items):
# Initial features
features = response.css('.re-DetailHeader-featuresItemIcon+ span')
feature_text = features.xpath('./text()').extract()
feature_text = [feature.replace(" ","").replace("s","") for feature in feature_text]
feature_value = features.xpath('./span/text()').extract()
## Floor
try:
index = feature_text.index("Planta")
items["Floor"] = feature_value[index]
except:
pass
features = np.nan
feature_value = np.nan
yield items
这个想法是代码蜘蛛访问第一页中的每个项目并获取我需要的信息。但是,某些项目中缺少某些功能。我面临的问题是 Scrapy 会自动用前一个值填充缺失值。
如您所见,我已尝试在 feature_value
函数的开头和结尾重新启动 features
和 parse_rent
的值,但这似乎不起作用.我也尝试过实现 for 循环来填充每个项目,但这也不起作用。
我对 Scrapy 很陌生,我错过了什么吗?或者这是 Scrapy 经常做的事情?