发送python HTTP请求后,它的响应(数据)具有一个html页面,其中包含许多ABCD块。这是一个片段
<tr>
<td class="success"></td>
<td class="truncate">ABCD</td>
<td>12/18/2018 21:45</td>
<td>12/18/2018 21:46</td>
<td>10</td>
<td>10</td>
<td>100.0</td>
<td><span class="label success">Success</span></td>
<td>SMS</td>
<td>
<a data-id="134717" class="btn" title="Go">View</a>
</td>
</tr>
我需要检索ABCD的最新数据ID(在这种情况下为134717,并且此数字是动态的)。还请注意,其中许多ABCD的日期不同,我想要最新的日期。
我可以使用正则表达式并逐行进行操作。但是我认为与BeautifulSoup一起使用会更好。
我尝试过此操作,它可以找到所有ABCD,但是我不知道如何获取最新的ABCD:
soup = BeautifulSoup(data, "html.parser")
for i in soup.select("td.truncate"):
#print(i.text)
if i.text == "ABCD":
print ("Got it ", i.text)
id1 = soup.select_one("a.data-id")
print (id1)
parsed_url1 = urlparse(id1)
答案 0 :(得分:2)
为此您需要the dateutils parser。显然,无法确定哪个<td>
中有日期,因此您只需要遍历匹配的tr中的所有td并尝试解析日期时间,如果datetime解析成功,则只需追加将其添加到日期列表以获取特定ID。在获得每个ID的所有日期之后,您只需最大化它们以查找最新日期。
from dateutil import parser as du_parser
from collections import defaultdict
from bs4 import BeautifulSoup as BS
data = "<tr><td class=\"success\"></td><td class=\"truncate\">ABCD</td><td>12/18/2018 21:45</td><td>12/18/2018 21:46</td><td>10</td><td>10</td><td>100.0</td><td><span class=\"label success\">Success</span></td><td>SMS</td><td><a data-id=\"134717\" class=\"btn\" title=\"Go\">View</a></td></tr>"
b1 = BS(data, "html.parser")
td_of_interest = b1.find_all("td")
tr_that_contain_our_td = [x.parent for x in b1.find_all("td", string="ABCD")]
ids_dict = defaultdict(list)
# iterate over matched tr's to get their dates
for tr in tr_that_contain_our_td:
extracted_id = tr.find("a")['data-id']
for td in tr.find_all("td"):
try:
if len(td.contents) > 0:
actual_date = du_parser.parse(td.contents[0])
ids_dict[extracted_id].append(actual_date)
except ValueError:
pass #nothing to do here
ids_dict = {k: max(v) for k, v in ids_dict.items()}
print(ids_dict)
答案 1 :(得分:2)
假设html遵循相同的模式:
给定:
html = ''' <tr>
<td class="success"></td>
<td class="truncate">ABCD</td>
<td>12/18/2018 21:45</td>
<td>12/18/2018 21:46</td>
<td>10</td>
<td>10</td>
<td>100.0</td>
<td><span class="label success">Success</span></td>
<td>SMS</td>
<td>
<a data-id="134717" class="btn" title="Go">View</a>
</td>
</tr>
<tr>
<td class="success"></td>
<td class="truncate">ABCD</td>
<td>12/20/2018 21:45</td>
<td>12/20/2018 21:46</td>
<td>99</td>
<td>99</td>
<td>999.0</td>
<td><span class="label success999">Success</span></td>
<td>SMS99</td>
<td>
<a data-id="9913471799" class="btn" title="Go">View</a>
</td>
</tr>
<tr>
<td class="success"></td>
<td class="truncate">ABCD</td>
<td>12/22/2018 21:45</td>
<td>12/22/2018 21:46</td>
<td>99</td>
<td>99</td>
<td>999.0</td>
<td><span class="label success999">Success</span></td>
<td>SMS99</td>
<td>
<a data-id="found the latest date" class="btn" title="Go">View</a>
</td>
</tr>
<tr>
<td class="success"></td>
<td class="truncate">ABCD</td>
<td>12/21/2018 21:45</td>
<td>12/21/2018 21:46</td>
<td>99</td>
<td>99</td>
<td>999.0</td>
<td><span class="label success999">Success</span></td>
<td>SMS99</td>
<td>
<a data-id="9913471799" class="btn" title="Go">View</a>
</td>
</tr>'''
查找最新日期:
import bs4
import re
import datetime
dates_list = []
soup = bs4.BeautifulSoup(html, 'html.parser')
for i in soup.select("td.truncate"):
#print(i.parent.text)
match = re.search(r'\d{2}/\d{2}/\d{4}', i.parent.text)
date = datetime.datetime.strptime(match.group(), '%m/%d/%Y').date()
date = date.strftime('%m/%d/%Y')
dates_list.append(date)
dates_list.sort()
most_recent = dates_list[-1]
rows = soup.find_all('tr')
for row in rows:
if str(most_recent) in row.text:
id1 = row.find("a").get('data-id')
print (id1)
答案 2 :(得分:0)
如果data-id
的数字不断增加,则可以使用a
选择具有最高data-id
值的max()
标签。
recentDataID = max([x.get('data-id') for x in soup.select("a[data-id]")])
print(recentDataID)
# if you want to select the parent or `tr`
mostRecentRow = soup.select_one('a[data-id=%s]' % recentDataID).parent.parent