我正试图用scrapy刮一张桌子 该表有tr> th> td元素 这是表格的结构
<table class="project-table">
<tr>
<th>Price Per Sqft from</th>
<td>AED 880</td>
</tr>
<tr>
<th>Type</th>
<td class="project-typess">
<a href="https://dxboffplan.com/new/apartments-for-sale-dubai/">Apartments</a>
</td>
</tr>
你可以注意到一些td元素只有文本而其他元素有 a 元素 这是我到目前为止的代码
def parse(self, response):
# get the urls of each property
urls = response.css('div.property-listing > a::attr(href)').extract()
# for each property make a request to get the details of each property
for url in urls:
yield scrapy.Request(url = url , callback = self.parse_details )
# go and get the next link for the next property
next_page = response.css('div.property-listing > a::attr(href)').extract_first()
# to get the details of the property we go throught a life cycle
yield scrapy.Request(url = next_page , callback = self.parse )
def parse_details(self , response):
for item in response.css('table.project-table> tr '):
var = DxbItem()
var['item'] = item.css('th::text').extract()[0]
var['value'] = item.css('td::text').extract()[0]
# i've tried everything i know but nothing works
if not var['value']:
var['value'] = item.css('td>a::text').extract()[0]
yield var
我需要获得v [&#39; value&#39;] IndexError:列表索引超出范围 我也试过这个
for item in response.css('table.project-table> tr '):
var = DxbItem()
var['item'] = item.css('th::text').extract()[0]
a = item.css('td::text').extract()[0] #
b = item.css('td>a::text').extract()[0]
var['value'] = a + b # to concatenate 2 lists
yield var
答案 0 :(得分:1)
在文档中,您可以按extract_first()
方法进行搜索。
但是,使用.extract_first()可以避免IndexError,并在找不到与选择匹配的任何元素时返回None。