这是HTML
内容:
<table cellspacing="1" cellpadding="0" class="data">
<tr class="colhead">
<th colspan="3">Expression</th>
</tr>
<tr class="colhead">
<th>Task</th>
<th>Action</th>
<th>List</th>
</tr>
<tr class="rowLight">
<td width="40%">
Task1
</td>
<td width="20%">
Assigned to
</td>
<td width="40%">
Harry
</td>
</tr>
<tr class="rowDark">
<td width="40%">
Task2
</td>
<td width="20%">
Rejected by
</td>
<td width="40%">
Lopa
</td>
</tr>
<tr class="rowLight">
<td width="40%">
Task5
</td>
<td width="20%">
Accepted By
</td>
<td width="40%">
Mathew
</td>
</tr>
现在我必须得到如下值:(下表只是一个Excel表格,一旦达到值,我将建立起来。)
Task Action List
Task1 Assigned to Harry
Task2 Rejected by Lopa
Task5 Accepted By Mathew
一个外行人解决方案我所知道的如下:
from bs4 import BeautifulSoup
soup = BeautifulSoup(source_URL)
alltables = soup.findAll( "table", {"border":"2", "width":"100%"} )
t = [x for x in soup.findAll('td')]
[x.renderContents().strip('\n') for x in t]
但是在我上面的HTML
内容中这样的结构不存在,那么如何处理呢?请指导我!
答案 0 :(得分:2)
使用.stripped_strings
从表格行中获取“有趣”的文字:
rows = table.find_all('tr', class_=('rowLight', 'rowDark'))
for row in rows:
print list(row.stripped_strings)
输出:
[u'Task1', u'Assigned to', u'Harry']
[u'Task2', u'Rejected by', u'Lopa']
[u'Task5', u'Accepted By', u'Mathew']
或者,将所有内容放入一个列表列表中(根据请求,不包括最后一行):
data = [list(r.stripped_strings) for r in rows[:-1]]
变为:
data = [[u'Task1', u'Assigned to', u'Harry'], [u'Task2', u'Rejected by', u'Lopa']]
.find_all()
的结果,ResultSet
,就像Python列表一样,您可以随意切片以忽略某些行,例如。