纠正表数据Scrapy的xpath

时间:2017-09-28 15:23:44

标签: html python-3.x xpath web-scraping scrapy

我试图使用以下html从表中删除数据:

enter image description here

很抱歉加载为图片,当我尝试粘贴代码时,它无法正确显示,但我只对与突出显示的类关联的文本感兴趣。

我曾试图使用例如response.xpath('//table/tbody/td').extract()返回树,但不返回任何内容。我也尝试过访问类,例如response.xpath('//div/div/div/div/div/div/table/tbody/tr/td[class="pricePweek"]').extract(),但这又没有返回。这是在这里出现问题的换行符吗?

我以前在使用Scrapy时没有遇到过这个问题,但是没有尝试过像这样的表结构。

2 个答案:

答案 0 :(得分:2)

我不确定您更喜欢哪种输出。假设您的预期输出是数据表的每一行一个项目,这是一个示例代码(您可能需要删除ipython控制台提示):

cache

这是印刷品:

In [10]: for tr in response.xpath('//table/tbody/tr'):
    ...:     item = dict()
    ...:     item['title'] = tr.xpath('./td[@class="title"]/text()').extract_first().strip()
    ...:     item['description'] = ','.join(x.strip() for x in tr.xpath('./td[@class="description"]//text()').extract())
    ...:     item['pricePweek'] = tr.xpath('./td[@class="pricePweek"]//text()').extract_first().strip()
    ...:     item['weeks'] = tr.xpath('./td[@class="weeks"]/text()').extract_first().strip()
    ...:     item['bookFees'] = tr.xpath('./td[@class="bookFees"]/text()').extract_first().strip()
    ...:     item['total'] = tr.xpath('./td[@class="total"]/text()').extract_first().strip()
    ...:     item['sDate'] = tr.xpath('./td[@class="sDate"]/text()').extract_first().strip()
    ...:     item['bookLink'] = tr.xpath('./td[@class="bookLink"]/a/@href').extract_first().strip()
    ...:     print(item)

请注意,由于某些单元格包含其他元素,因此您需要正确处理它们。例如,描述单元格包含一个无序列表,在这里我通过分隔{'title': 'En-Suite (Ground Floor)', 'description': '10.5sqm,3/4 bed,En-suite Bathroom (WC, Basin and Bath),Use of ground floor communal kitchen', 'pricePweek': '£163.00', 'weeks': '50', 'bookFees': '£250.00', 'total': '£8,150.00', 'sDate': '23 Sep 2017', 'bookLink': 'https://www.crm-students.com/crm-accommodation/application-form/?tx_wistcas_booknow%5BroomType%5D=2917&tx_wistcas_booknow%5Bwait%5D=1&tx_wistcas_booknow%5BbookingPeriod%5D=5386&tx_wistcas_booknow%5Baction%5D=book0&tx_wistcas_booknow%5Bcontroller%5D=RoomType&cHash=3dd0f1b377330cfbad6327b728678cbd'} {'title': 'En-Suite (Ground Floor)', 'description': '10.5sqm,3/4 bed,En-suite Bathroom (WC, Basin and Bath),Use of ground floor communal kitchen', 'pricePweek': '£163.00', 'weeks': '49', 'bookFees': '£250.00', 'total': '£7,987.00', 'sDate': '30 Sep 2017', 'bookLink': 'https://www.crm-students.com/crm-accommodation/application-form/?tx_wistcas_booknow%5BroomType%5D=2917&tx_wistcas_booknow%5Bwait%5D=1&tx_wistcas_booknow%5BbookingPeriod%5D=6075&tx_wistcas_booknow%5Baction%5D=book0&tx_wistcas_booknow%5Bcontroller%5D=RoomType&cHash=db85ff90cacb487ee98942d955141b09'} {'title': 'Large Studio (Courtyard)', 'description': '22-23m,2,3/4 bed,Generous studio with same features as "Standard" but slightly larger,Dual Occupancy is available for an additional 20% of the advertised rate per week', 'pricePweek': '£223.00', 'weeks': '51', 'bookFees': '£250.00', 'total': '£11,373.00', 'sDate': '16 Sep 2017', 'bookLink': 'https://www.crm-students.com/crm-accommodation/application-form/?tx_wistcas_booknow%5BroomType%5D=718&tx_wistcas_booknow%5Bwait%5D=1&tx_wistcas_booknow%5BbookingPeriod%5D=5652&tx_wistcas_booknow%5Baction%5D=book0&tx_wistcas_booknow%5Bcontroller%5D=RoomType&cHash=e959ccd71b62be9211eb1dd3ad5b362c'} {'title': 'Large Studio (Courtyard)', 'description': '22-23m,2,3/4 bed,Generous studio with same features as "Standard" but slightly larger,Dual Occupancy is available for an additional 20% of the advertised rate per week', 'pricePweek': '£223.00', 'weeks': '49', 'bookFees': '£250.00', 'total': '£10,927.00', 'sDate': '30 Sep 2017', 'bookLink': 'https://www.crm-students.com/crm-accommodation/application-form/?tx_wistcas_booknow%5BroomType%5D=718&tx_wistcas_booknow%5Bwait%5D=1&tx_wistcas_booknow%5BbookingPeriod%5D=6075&tx_wistcas_booknow%5Baction%5D=book0&tx_wistcas_booknow%5Bcontroller%5D=RoomType&cHash=5f798c129cfe56dead110ed5d80efa75'} 将它们连接起来。

希望这会有所帮助。

由于

答案 1 :(得分:1)

您的问题是您正在使用浏览器验证您的Xpath,然后在Scrapy上使用它们。这可能不会给你一个真实的图片。考虑下面的html页面

<html>
<body>
<table>
 <tr>
  <td class="name">Tarun</td>
</tr>
</table>
</body>
</html>

如果您将HTML保存在文件中并在浏览器中打开

tbody added

您能看到浏览器添加的tbody吗?这不在我们的源代码中。哪种scrapy会看到。所以你的xpath不应该包含tbody。如果你在下面使用它应该工作

price = response.xpath('//td[class="pricePweek"]').extract()