我正在使用scrapy从表格中抓取网站上的内容。
代码示例:
<tr>
<td><div>2018/2058</div></td>
<td class="address"><div>Land North of 37 and 39 Hare Lane Claygate Esher Surrey KT10 9BT</div></td>
<td class="proposal"><div>Confirmation of Compliance with Conditions: 5 (Tree Protection and Pre-Commencement Inspection) and 6 (Tree Protection) of planning permission 2017/0451.</div></td>
<td><div style="min-width:90px">Claygate Ward</div></td>
</tr>
但是您可以看到文本在每个“ tr”标签的div内,我如何使用xpath或CSS选择器获取文本?
我尝试过
yield {
'applicaition-number':response.xpath('//div[contains(concat(" ", normalize-space(@id), " "), " atWeeklyListTable ")]//td[ @class="selectorgadget_selected"]/div/text()').extract_first(),
'address': response.xpath('//div[contains(concat(" ", normalize-space(@id), " "), " atWeeklyListTable ")]//td[ @class="address selectorgadget_suggested"]/div/text()').extract_first(),
'proposal': response.xpath('//div[contains(concat(" ", normalize-space(@id), " "), " atWeeklyListTable ")]//td[ @class="proposal selectorgadget_suggested"]/div/text()').extract_first(),
}
这是网站:
谢谢!
答案 0 :(得分:2)
first_td_text = response.xpath('//tr[1]/td[1]/div/text()').extract_first()
更新
'address': response.xpath('//td[@class="address"]/div/text()').extract_first(),
答案 1 :(得分:0)
使用来自gangabass的xpath:
import scrapy
class txt_filter:
txt= '<tr>\
<td><div>2018/2058</div></td>\
<td class="address"><div>Land North of 37 and 39 Hare Lane Claygate Esher Surrey KT10 9BT</div></td>\
<td class="proposal"><div>Confirmation of Compliance with Conditions: 6 (Tree Protection and Pre-Commencement Inspection) and 6 (Tree Protection) of planning permission 2017/0451.</div></td>\
<td><div style="min-width:90px">Claygate Ward</div></td>\
</tr>'
resp = scrapy.http.response.text.TextResponse(body=txt,url='abc',encoding='utf-8')
print(resp.xpath('//tr[1]/td/div/text()').extract())
仅从td中删除[1]以获取所有行。
答案 2 :(得分:0)
您可以使用熊猫轻松地做到这一点。
table = pd.read_html(url)
现在表是一个包含整个表的数据框