我试图使用以下html从表中删除数据:
很抱歉加载为图片,当我尝试粘贴代码时,它无法正确显示,但我只对与突出显示的类关联的文本感兴趣。
我曾试图使用例如response.xpath('//table/tbody/td').extract()
返回树,但不返回任何内容。我也尝试过访问类,例如response.xpath('//div/div/div/div/div/div/table/tbody/tr/td[class="pricePweek"]').extract()
,但这又没有返回。这是在这里出现问题的换行符吗?
我以前在使用Scrapy时没有遇到过这个问题,但是没有尝试过像这样的表结构。
答案 0 :(得分:2)
我不确定您更喜欢哪种输出。假设您的预期输出是数据表的每一行一个项目,这是一个示例代码(您可能需要删除ipython控制台提示):
cache
这是印刷品:
In [10]: for tr in response.xpath('//table/tbody/tr'):
...: item = dict()
...: item['title'] = tr.xpath('./td[@class="title"]/text()').extract_first().strip()
...: item['description'] = ','.join(x.strip() for x in tr.xpath('./td[@class="description"]//text()').extract())
...: item['pricePweek'] = tr.xpath('./td[@class="pricePweek"]//text()').extract_first().strip()
...: item['weeks'] = tr.xpath('./td[@class="weeks"]/text()').extract_first().strip()
...: item['bookFees'] = tr.xpath('./td[@class="bookFees"]/text()').extract_first().strip()
...: item['total'] = tr.xpath('./td[@class="total"]/text()').extract_first().strip()
...: item['sDate'] = tr.xpath('./td[@class="sDate"]/text()').extract_first().strip()
...: item['bookLink'] = tr.xpath('./td[@class="bookLink"]/a/@href').extract_first().strip()
...: print(item)
请注意,由于某些单元格包含其他元素,因此您需要正确处理它们。例如,描述单元格包含一个无序列表,在这里我通过分隔{'title': 'En-Suite (Ground Floor)', 'description': '10.5sqm,3/4 bed,En-suite Bathroom (WC, Basin and Bath),Use of ground floor communal kitchen', 'pricePweek': '£163.00', 'weeks': '50', 'bookFees': '£250.00', 'total': '£8,150.00', 'sDate': '23 Sep 2017', 'bookLink': 'https://www.crm-students.com/crm-accommodation/application-form/?tx_wistcas_booknow%5BroomType%5D=2917&tx_wistcas_booknow%5Bwait%5D=1&tx_wistcas_booknow%5BbookingPeriod%5D=5386&tx_wistcas_booknow%5Baction%5D=book0&tx_wistcas_booknow%5Bcontroller%5D=RoomType&cHash=3dd0f1b377330cfbad6327b728678cbd'}
{'title': 'En-Suite (Ground Floor)', 'description': '10.5sqm,3/4 bed,En-suite Bathroom (WC, Basin and Bath),Use of ground floor communal kitchen', 'pricePweek': '£163.00', 'weeks': '49', 'bookFees': '£250.00', 'total': '£7,987.00', 'sDate': '30 Sep 2017', 'bookLink': 'https://www.crm-students.com/crm-accommodation/application-form/?tx_wistcas_booknow%5BroomType%5D=2917&tx_wistcas_booknow%5Bwait%5D=1&tx_wistcas_booknow%5BbookingPeriod%5D=6075&tx_wistcas_booknow%5Baction%5D=book0&tx_wistcas_booknow%5Bcontroller%5D=RoomType&cHash=db85ff90cacb487ee98942d955141b09'}
{'title': 'Large Studio (Courtyard)', 'description': '22-23m,2,3/4 bed,Generous studio with same features as "Standard" but slightly larger,Dual Occupancy is available for an additional 20% of the advertised rate per week', 'pricePweek': '£223.00', 'weeks': '51', 'bookFees': '£250.00', 'total': '£11,373.00', 'sDate': '16 Sep 2017', 'bookLink': 'https://www.crm-students.com/crm-accommodation/application-form/?tx_wistcas_booknow%5BroomType%5D=718&tx_wistcas_booknow%5Bwait%5D=1&tx_wistcas_booknow%5BbookingPeriod%5D=5652&tx_wistcas_booknow%5Baction%5D=book0&tx_wistcas_booknow%5Bcontroller%5D=RoomType&cHash=e959ccd71b62be9211eb1dd3ad5b362c'}
{'title': 'Large Studio (Courtyard)', 'description': '22-23m,2,3/4 bed,Generous studio with same features as "Standard" but slightly larger,Dual Occupancy is available for an additional 20% of the advertised rate per week', 'pricePweek': '£223.00', 'weeks': '49', 'bookFees': '£250.00', 'total': '£10,927.00', 'sDate': '30 Sep 2017', 'bookLink': 'https://www.crm-students.com/crm-accommodation/application-form/?tx_wistcas_booknow%5BroomType%5D=718&tx_wistcas_booknow%5Bwait%5D=1&tx_wistcas_booknow%5BbookingPeriod%5D=6075&tx_wistcas_booknow%5Baction%5D=book0&tx_wistcas_booknow%5Bcontroller%5D=RoomType&cHash=5f798c129cfe56dead110ed5d80efa75'}
将它们连接起来。
希望这会有所帮助。
由于
答案 1 :(得分:1)
您的问题是您正在使用浏览器验证您的Xpath,然后在Scrapy上使用它们。这可能不会给你一个真实的图片。考虑下面的html页面
<html>
<body>
<table>
<tr>
<td class="name">Tarun</td>
</tr>
</table>
</body>
</html>
如果您将HTML保存在文件中并在浏览器中打开
您能看到浏览器添加的tbody
吗?这不在我们的源代码中。哪种scrapy会看到。所以你的xpath不应该包含tbody
。如果你在下面使用它应该工作
price = response.xpath('//td[class="pricePweek"]').extract()