很长的歉意-
我有一张要尝试使用scrapy进行挖掘的表,但无法完全弄清楚如何深入该表。
这是表格:
<table class="detail-table" border="0" cellspacing="0">
<tbody>
<tr id="trAnimalID">
...
</tr>
<tr id="trSpecies">
...
</tr>
<tr id="trBreed">
...
</tr>
<tr id="trAge">
...
<tr id="trSex">
...
</tr>
<tr id="trSize">
...
</tr>
<tr id="trColor">
...
</tr>
<tr id="trDeclawed">
...
</tr>
<tr id="trHousetrained">
...
</tr>
<tr id="trLocation">
...
</tr>
<tr id="trIntakeDate">
<td class="detail-label" align="right">
<b>Intake Date</b>
</td>
<td class="detail-value">
<span id="lblIntakeDate">3/31/2020</span>
</td>
</tr>
<tr id="trStage">
<td class="detail-label" align="right">
<b>Stage</b>
</td>
<td class="detail-value">
<span id="lblStage">Reserved</span>
</td>
</tr>
</tbody></table>
我可以使用scrapy shell命令对其进行深入研究:
text = response.xpath('//*[@class="detail-table"]//tr')[10].extract()
我回来了:
'<tr id="trIntakeDate">\r\n\t
<td class="detail-label" align="right">\r\n
<b>Intake Date</b>\r\n
</td>\r\n\t
<td class="detail-value">\r\n
<span id="lblIntakeDate">3/31/2020</span>\xa0\r\n
</td>\r\n
</tr>'
我不太清楚如何获取lblIntakeDate的值。我只需要 2020/3/31 。另外,我想以lambda的形式运行它,并且无法完全像我可以使用命令行那样弄清楚如何获取execute函数来转储json文件。有什么想法吗?
答案 0 :(得分:1)
尝试一下:
//table[@class='detail-table']/tbody//tr/td/span[@id='lblIntakeDate']/text()
转到https://www.online-toolz.com/tools/xpath-tester-online.php
并且请删除多余的字符,例如
答案 1 :(得分:0)
尝试:
from urllib.request import urlopen
url = ''
html = urlopen(url)
bs = BeautifulSoup(html.read(), 'html.parser')
for i in bs.find_all('a'):
print(i.get_text())