是否有更pythonic方式使用Beautifulsoup解析我的表

时间:2015-02-19 10:10:11

标签: python parsing beautifulsoup strip

我是Python的新手。我有一个html页面,其中的表格类似于下面的表格。我想以更整洁的方式解析并处理这些数据。

<table border="1">
    <tr><td><b>Test Results</b></td><td><b>Log File</b></td><td><b>Passes</b></td><td><b>Fails</b></td></tr>
    <tr><td><b>Test suite A</b></td><td><a href="A_logs.html">Logs</a></td><td><b>10</b></td><td><b>0</b></td></tr>
    <tr><td><b>Test suite B</b></td><td><a href="B_logs.html">Logs</a></td><td><b>20</b></td><td><b>0</b></td></tr>
    <tr><td><b>Test suite C</b></td><td><a href="C_logs.html">Logs</a></td><td><b>15</b></td><td><b>0</b></td></tr>
</table>

使用BeautifulSoup我在表格中解析过。

results_table = tables[0] # This will get the first table on the page.
table_rows = my_table.findChildren(['th','tr'])

for i in table_rows:
    text = str(i)
    print( "All rows:: {0}\n".format(text))
    if "Test suite A" in text:
        print( "Test Suite: {0}".format(text))
        # strip out html characters
        list = str(BeautifulSoup(text).findAll( text = True )) 
        # strip out any further stray characters such as [,] 
        list = re.sub("[\'\[\]]", "", list) 
        list = list.split(',') # split my list entries by comma
        print("Test: {0}".format(str(list[0])))
        print("Logs: {0}".format(str(list[1])))
        print("Pass: {0}".format(str(list[3])))
        print("Fail: {0}".format(str(list[4])))

这样我的代码可以完成我想要的一切。我只是想知道是否有更多的pythonic方式来做到这一点。忽略print语句,因为我打算将它放入自己的方法中,传入结果表并返回pass,fail,logs,test。

所以..

def parseHtml(results_table)
    # split out all rows in my table into a list
    table_rows = my_table.findChildren(['th','tr'])
    for i in table_rows:
        text = str(i)
        if "Test suite A" in text:
            # strip out html characters
            list = str(BeautifulSoup(text).findAll( text = True )) 
            # strip out any further stray characters such as [,] 
            list = re.sub("[\'\[\]]", "", list) 
            # split my list entries by comma
            list = list.split(',') 
     return (list[0],list[1],list[3],list[4])

2 个答案:

答案 0 :(得分:1)

html="""<table border="1">
    <tr><td><b>Test Results</b></td><td><b>Log File</b></td><td><b>Passes</b></td><td><b>Fails</b></td></tr>
    <tr><td><b>Test suite A</b></td><td><a href="A_logs.html">Logs</a></td><td><b>10</b></td><td><b>0</b></td></tr>
    <tr><td><b>Test suite B</b></td><td><a href="B_logs.html">Logs</a></td><td><b>20</b></td><td><b>0</b></td></tr>
    <tr><td><b>Test suite C</b></td><td><a href="C_logs.html">Logs</a></td><td><b>15</b></td><td><b>0</b></td></tr>
</table>"""


from bs4 import BeautifulSoup

soup = BeautifulSoup(html)
data = soup.find_all("b")  # this will be your table

# ignore Test Result etc.. and get Test suite A ... from each row
data = (data[4:][i:i+3] for i in range(0, len(data[4:]),3))
# get all log file names 
logs = iter(x["href"] for x in soup.find_all("a",href=True))

# unpack each subelement and print the tag text
for a, b, c in data:
    print("Test: {}, Log: {}, Pass: {}, Fail: {}".format(a.text ,next(logs),b.text, c.text))


Test: Test suite A, Log: A_logs.html, Pass: 10, Fail: 0
Test: Test suite B, Log: B_logs.html, Pass: 20, Fail: 0
Test: Test suite C, Log: C_logs.html, Pass: 15, Fail: 0

不要使用list作为变量名,因为它会影响python list,如果你想从你的find_all调用中获取子列表中的元素迭代或索引,请不要使用重。

答案 1 :(得分:-1)

在这种情况下,我倾向于迭代'tr'然后'td'

bs_table = BeautifulSoup(my_table)
ls_rows = []
for ls_tr in bs_table.findAll('tr'):
  ls_rows.append([td_bloc.text for td_bloc in ls_tr.findAll('td')])