我正在搜寻一个带有属性列表的网站,并且仅在列表页面中找到了“购买/出租”。我通过从parse方法解析到每个parse_property方法的URL来从详细信息页面中提取了其他数据,但是我是无法从主列表页面获取商品类型。
我试图以解析单个网址的相同方式进行操作。(带注释的代码)
def parse(self, response):
properties = response.xpath('//div[@class="property-information-address"]/a')
for property in properties:
url= property.xpath('./@href').extract_first()
yield Request(url, callback=self.parse_property, meta={'URL':url})
# TODO: offering
# offering=response.xpath('//div[@class="property-status"]')
# for of in offerings:
# offering=of.xpath('./a/text()').extract_first()
# yield Request(offering, callback=self.parse_property, meta={'Offering':offering})
next_page=response.xpath('//div[@class="pagination"]/a/@href')[-2].extract()
yield Request(next_page, callback=self.parse)
def parse_property(self, response):
l = ItemLoader(item=NPMItem(), response=response)
url=response.meta.get('URL')
#offer=response.meta.get('Offering')
l.add_value('URL', response.url)
#l.add_value('Offering', response.offer)
答案 0 :(得分:1)
您可以尝试依赖DOM树中较高的元素,并从此处刮除属性类型和链接。检查此代码示例,它可以正常工作:
def parse(self, response):
properties = response.xpath('//div[@class="property-listing"]')
for property in properties:
url = property.xpath('.//div[@class="property-information-address"]/a/@href').get()
ptype = property.xpath('.//div[@class="property-status"]/a/text()').get()
yield response.follow(url, self.parse_property, meta={'ptype': ptype})
next_page = response.xpath('//link[@rel="next"]/@href').get()
if next_page:
yield response.follow(next_page, callback=self.parse)
def parse_property(self, response):
print '======'
print response.meta['ptype']
print '======'
# build your item here, printing is only to show content of `ptype`