<tr id="inmate_201700220865">
<td class="row ">3</td>
<td class="row "><a href="javascript:" onclick="getInmatePreview(201700220865)">View</a>
<input type="hidden" id="bookingPhoto_201700220865" value="http://bookings.example.org/201708/20170826.AA8">
<input type="hidden" id="bookingPhotoFile_201700220865" value="20170826.AA8">
<input type="hidden" id="bookingPhotoFolder_201700220865" value="201708">
<input type="hidden" id="bookingPhotoName_201700220865" value="LAST, FIRST LAST">
<input type="hidden" id="inmateID_201700220865" value="277497">
<input type="hidden" id="index_2" value="201700220865">
<input type="hidden" id="curIndex_201700220865" value="2"></td>
<td class="row ">LAST<input type="hidden" id="bookingLastName_201700220865" value="LAST"></td>
<td class="row ">FIRST<input type="hidden" id="bookingFirstName_201700220865" value="FIRST"></td>
<td class="row ">LAST<input type="hidden" id="bookingLastName_201700220865" value="LAST"></td>
<td class="row ">08/26/2017</td>
<td class="row ">41</td>
<td class="row ">M</td>
</tr>
我试图从这张表中删除最后6行文字。我没有通过Beautiful Soup执行嵌套循环时遇到困难。我确信有一种更简单的方法,但是对于记录我只需要姓氏,名字,姓氏和最后三行,即DOB,年龄和性别。下面是我的代码,它返回整个tr
。
from bs4 import BeautifulSoup
from urllib.request import urlopen
import re
#beautiful soup scrape
scraped = urlopen('http://www.example.org/inmates/').read()
soup = BeautifulSoup(scraped, 'html.parser')
for item in soup.find_all('tr',{'id' : re.compile('^inmate') }):
for name in item ('td',{'class' : "row alt"}):
print (item)
提前致谢
答案 0 :(得分:0)
找到所有tr
代码并使用get_text()
方法获取文本。然后按\n
split()
filter
,并使用{{3}}删除空字符串。在这里,您可以在一行中获得所需的所有数据。
from bs4 import BeautifulSoup
from urllib.request import urlopen
import re
#beautiful soup scrape
scraped = urlopen('http://www.example.org/inmates/').read()
soup = BeautifulSoup(scraped, 'html.parser')
for item in soup.find_all('tr', {'id' : re.compile('^inmate')}):
data = list(filter(None, item.get_text().split('\n')))
print(data)
<强>输出强>
['3', 'View', 'LAST Name', 'FIRST Name', 'LAST Name', '08/26/2017', '41', 'M']
如果您要删除前2个元素,则只需slice
列表
data = list(filter(None, item.get_text().split('\n')))[2:]
<强>输出强>
['LAST', 'FIRST', 'LAST', '08/26/2017', '41', 'M']