Python BeautifulSoup如何获取最新选择器的数据

时间:2018-12-21 23:49:30

标签: python beautifulsoup

发送python HTTP请求后,它的响应(数据)具有一个html页面,其中包含许多ABCD块。这是一个片段

                   <tr>
                        <td class="success"></td>
                        <td class="truncate">ABCD</td>
                        <td>12/18/2018 21:45</td>
                        <td>12/18/2018 21:46</td>
                        <td>10</td>
                        <td>10</td>
                        <td>100.0</td>
                        <td><span class="label success">Success</span></td>
                        <td>SMS</td>
                        <td>
                            <a data-id="134717" class="btn" title="Go">View</a>
                        </td>
                    </tr>

我需要检索ABCD的最新数据ID(在这种情况下为134717,并且此数字是动态的)。还请注意,其中许多ABCD的日期不同,我想要最新的日期。

我可以使用正则表达式并逐行进行操作。但是我认为与BeautifulSoup一起使用会更好。

我尝试过此操作,它可以找到所有ABCD,但是我不知道如何获取最新的ABCD:

    soup = BeautifulSoup(data, "html.parser")
    for i in soup.select("td.truncate"):
        #print(i.text)
        if i.text == "ABCD":
            print ("Got it ", i.text)
            id1 = soup.select_one("a.data-id")
            print (id1)
            parsed_url1 = urlparse(id1)

3 个答案:

答案 0 :(得分:2)

为此您需要the dateutils parser。显然,无法确定哪个<td>中有日期,因此您只需要遍历匹配的tr中的所有td并尝试解析日期时间,如果datetime解析成功,则只需追加将其添加到日期列表以获取特定ID。在获得每个ID的所有日期之后,您只需最大化它们以查找最新日期。

from dateutil import parser as du_parser    
from collections import defaultdict
from bs4 import BeautifulSoup as BS

data = "<tr><td class=\"success\"></td><td class=\"truncate\">ABCD</td><td>12/18/2018 21:45</td><td>12/18/2018 21:46</td><td>10</td><td>10</td><td>100.0</td><td><span class=\"label success\">Success</span></td><td>SMS</td><td><a data-id=\"134717\" class=\"btn\" title=\"Go\">View</a></td></tr>"
b1 = BS(data, "html.parser")

td_of_interest = b1.find_all("td")
tr_that_contain_our_td = [x.parent for x in b1.find_all("td", string="ABCD")]

ids_dict = defaultdict(list)

# iterate over matched tr's to get their dates
for tr in tr_that_contain_our_td:
    extracted_id = tr.find("a")['data-id']

    for td in tr.find_all("td"):
        try:
            if len(td.contents) > 0:
                actual_date = du_parser.parse(td.contents[0])
                ids_dict[extracted_id].append(actual_date)
        except ValueError:
            pass  #nothing to do here

ids_dict = {k: max(v) for k, v in ids_dict.items()}

print(ids_dict)

答案 1 :(得分:2)

假设html遵循相同的模式:

给定:

html = '''                   <tr>
                        <td class="success"></td>
                        <td class="truncate">ABCD</td>
                        <td>12/18/2018 21:45</td>
                        <td>12/18/2018 21:46</td>
                        <td>10</td>
                        <td>10</td>
                        <td>100.0</td>
                        <td><span class="label success">Success</span></td>
                        <td>SMS</td>
                        <td>
                            <a data-id="134717" class="btn" title="Go">View</a>
                        </td>
                    </tr>


                    <tr>
                        <td class="success"></td>
                        <td class="truncate">ABCD</td>
                        <td>12/20/2018 21:45</td>
                        <td>12/20/2018 21:46</td>
                        <td>99</td>
                        <td>99</td>
                        <td>999.0</td>
                        <td><span class="label success999">Success</span></td>
                        <td>SMS99</td>
                        <td>
                            <a data-id="9913471799" class="btn" title="Go">View</a>
                        </td>
                    </tr>

                                        <tr>
                        <td class="success"></td>
                        <td class="truncate">ABCD</td>
                        <td>12/22/2018 21:45</td>
                        <td>12/22/2018 21:46</td>
                        <td>99</td>
                        <td>99</td>
                        <td>999.0</td>
                        <td><span class="label success999">Success</span></td>
                        <td>SMS99</td>
                        <td>
                            <a data-id="found the latest date" class="btn" title="Go">View</a>
                        </td>
                    </tr>

                                        <tr>
                        <td class="success"></td>
                        <td class="truncate">ABCD</td>
                        <td>12/21/2018 21:45</td>
                        <td>12/21/2018 21:46</td>
                        <td>99</td>
                        <td>99</td>
                        <td>999.0</td>
                        <td><span class="label success999">Success</span></td>
                        <td>SMS99</td>
                        <td>
                            <a data-id="9913471799" class="btn" title="Go">View</a>
                        </td>
                    </tr>'''

查找最新日期:

import bs4
import re
import datetime                

dates_list = []

soup = bs4.BeautifulSoup(html, 'html.parser')

for i in soup.select("td.truncate"):
        #print(i.parent.text)
        match = re.search(r'\d{2}/\d{2}/\d{4}', i.parent.text)
        date = datetime.datetime.strptime(match.group(), '%m/%d/%Y').date()
        date = date.strftime('%m/%d/%Y')
        dates_list.append(date)

dates_list.sort()        
most_recent = dates_list[-1]

rows = soup.find_all('tr')
for row in rows:
    if str(most_recent) in row.text:
        id1 = row.find("a").get('data-id')  
        print (id1)

答案 2 :(得分:0)

如果data-id的数字不断增加,则可以使用a选择具有最高data-id值的max()标签。

recentDataID = max([x.get('data-id') for x  in soup.select("a[data-id]")])
print(recentDataID)

# if you want to select the parent or `tr`
mostRecentRow = soup.select_one('a[data-id=%s]' % recentDataID).parent.parent