抓取特定文本的嵌套网页

时间:2017-08-26 18:02:13

标签: python python-2.7 python-3.x web-scraping beautifulsoup

<tr id="inmate_201700220865">
    <td class="row ">3</td>
    <td class="row "><a href="javascript:" onclick="getInmatePreview(201700220865)">View</a>
    <input type="hidden" id="bookingPhoto_201700220865" value="http://bookings.example.org/201708/20170826.AA8">
    <input type="hidden" id="bookingPhotoFile_201700220865" value="20170826.AA8">
    <input type="hidden" id="bookingPhotoFolder_201700220865" value="201708">
    <input type="hidden" id="bookingPhotoName_201700220865" value="LAST, FIRST LAST">
    <input type="hidden" id="inmateID_201700220865" value="277497">
    <input type="hidden" id="index_2" value="201700220865">
    <input type="hidden" id="curIndex_201700220865" value="2"></td>
    <td class="row ">LAST<input type="hidden" id="bookingLastName_201700220865" value="LAST"></td>
    <td class="row ">FIRST<input type="hidden" id="bookingFirstName_201700220865" value="FIRST"></td>
    <td class="row ">LAST<input type="hidden" id="bookingLastName_201700220865" value="LAST"></td>
    <td class="row ">08/26/2017</td>
    <td class="row ">41</td>
    <td class="row ">M</td>
</tr>

我试图从这张表中删除最后6行文字。我没有通过Beautiful Soup执行嵌套循环时遇到困难。我确信有一种更简单的方法,但是对于记录我只需要姓氏,名字,姓氏和最后三行,即DOB,年龄和性别。下面是我的代码,它返回整个tr

from bs4 import BeautifulSoup
from urllib.request import urlopen
import re

#beautiful soup scrape
scraped = urlopen('http://www.example.org/inmates/').read()
soup = BeautifulSoup(scraped, 'html.parser')

for item in soup.find_all('tr',{'id' : re.compile('^inmate') }):
    for name in item ('td',{'class'  : "row alt"}):
        print (item)

提前致谢

1 个答案:

答案 0 :(得分:0)

找到所有tr代码并使用get_text()方法获取文本。然后按\n split() filter,并使用{{3}}删除空字符串。在这里,您可以在一行中获得所需的所有数据。

from bs4 import BeautifulSoup
from urllib.request import urlopen
import re

#beautiful soup scrape
scraped = urlopen('http://www.example.org/inmates/').read()
soup = BeautifulSoup(scraped, 'html.parser')

for item in soup.find_all('tr', {'id' : re.compile('^inmate')}):
    data = list(filter(None, item.get_text().split('\n')))
    print(data)

<强>输出

['3', 'View', 'LAST Name', 'FIRST Name', 'LAST Name', '08/26/2017', '41', 'M']

如果您要删除前2个元素,则只需slice列表

data = list(filter(None, item.get_text().split('\n')))[2:]

<强>输出

['LAST', 'FIRST', 'LAST', '08/26/2017', '41', 'M']