我正在使用BeautifulSoup解析具有以下结构的HTML文档:
<table>
<tr>
<th>Thread</th>
<td> (555EEE555)<br/>
<table>
<tr>
<th>Participants</th>
<td>John Doe<br/>Jane Doe<br/>
</td>
</tr>
</table><br/><br/>
<table>
<tr>
<th>Author</th>
<td>John Doe<br/></td>
</tr>
</table>
<table>
<tr>
<th>Sent</th>
<td>2017-10-16 19:03:23 UTC<br/>
</td>
</tr>
</table>
<table>
<tr>
<th>Body</th>
<td>Test message with some body text<br/>
</td>
</tr>
</table><br/>
<table>
<tr>
<th>Author</th>
<td>Jane Doe<br/></td>
</tr>
</table>
<table>
<tr>
<th>Sent</th>
<td>2017-10-17 08:03:23 UTC<br/>
</td>
</tr>
</table>
<table>
<tr>
<th>Body</th>
<td>Second test message with some body text<br/>
</td>
</tr>
</table><br/>
</td>
</tr>
</table>
此消息结构在整个文档中重复。我需要通过对Author
,Sent
和Body
表进行分组来解析单个消息。这是我到目前为止的代码:
with open(path) as g:
soup = BeautifulSoup(g, 'html.parser')
table_parent = soup.find('td')
for idx, i in enumerate(table_parent.find_all('table', recursive=False)):
for x in i.find_all('table'):
print 'key: %s | data: %s' % (x.th.get_text(), x.td.get_text())
打印以下内容:
key: Current Participants | data: John DoeJane Doe
key: Author | data: John Doe
key: Sent | data: 2017-10-16 19:03:23 UTC
key: Body | data: Test message with some body text
我该如何编写遍历整个文档的代码,并分别对Author
,Sent
和Body
进行分组,以解析出每条单独的消息?
答案 0 :(得分:1)
我假设您始终有一个主表作为父表
您应该能够做到这一点:
from bs4 import BeautifulSoup as soup
import requests
html = """<table>
<tr>
<th>Thread</th>
<td> (555EEE555)<br/>
<table>
<tr>
<th>Participants</th>
<td>John Doe<br/>Jane Doe<br/>
</td>
</tr>
</table><br/><br/>
<table>
<tr>
<th>Author</th>
<td>John Doe<br/></td>
</tr>
</table>
<table>
<tr>
<th>Sent</th>
<td>2017-10-16 19:03:23 UTC<br/>
</td>
</tr>
</table>
<table>
<tr>
<th>Body</th>
<td>Test message with some body text<br/>
</td>
</tr>
</table><br/>
<table>
<tr>
<th>Author</th>
<td>Jane Doe<br/></td>
</tr>
</table>
<table>
<tr>
<th>Sent</th>
<td>2017-10-17 08:03:23 UTC<br/>
</td>
</tr>
</table>
<table>
<tr>
<th>Body</th>
<td>Second test message with some body text<br/>
</td>
</tr>
</table><br/>
</td>
</tr>
</table>"""
def _get_obj():
r = {
'Author': '',
'Sent': '',
'Body': ''
}
return r
page = soup(html, 'html.parser')
main_table = page.find('table')
result = []
r = _get_obj()
for t in main_table.find_all('table'):
if t.find('th', text='Author'):
r['Author'] = t.find('td').get_text()
if t.find('th', text='Sent'):
r['Sent'] = t.find('td').get_text()
if t.find('th', text='Body'):
r['Body'] = t.find('td').get_text()
result.append(r)
r = _get_obj()
print(result)
输出:
[
{'Author': 'John Doe', 'Sent': '2017-10-16 19:03:23 UTC\n', 'Body': 'Test message with some body text\n'},
{'Author': 'Jane Doe', 'Sent': '2017-10-17 08:03:23 UTC\n', 'Body': 'Second test message with some body text\n'}
]
答案 1 :(得分:0)
这是我的解决方案,我继续您的代码:
table_parent = soup.find('td')
tables = table_parent.find_all('table', recursive=False)
tables_str = " ".join([str(t) for t in tables[1:]])
soup_tables = BeautifulSoup(tables_str)
trs = soup_tables.find_all("tr")
for i in xrange(0, len(trs), 3):
print(trs[i].contents[1].text, trs[i].contents[3].text)
print(trs[i+1].contents[1].text, trs[i+1].contents[3].text)
print(trs[i+2].contents[1].text, trs[i+2].contents[3].text)
print("-"*8)
这将打印:
(u'Author', u'John Doe')
(u'Sent', u'2017-10-16 19:03:23 UTC\n')
(u'Body', u'Test message with some body text\n')
--------
(u'Author', u'Jane Doe')
(u'Sent', u'2017-10-17 08:03:23 UTC\n')
(u'Body', u'Second test message with some body text\n')
--------
如果您需要一些解释,请问我
答案 2 :(得分:0)
我宁愿避免使用表格,而只关注<th>
,Author
和Sent
文本的Body
标记。然后,您可以使用 find_next() 获取下一个td
并获取其中的文本。然后,您可以使用zip()函数来汇总数据。如果标记位于变量html_doc
中,则下面的代码应该起作用。
from bs4 import BeautifulSoup
soup=BeautifulSoup(html_doc,'html.parser')
authors=[x.find_next('td').text for x in soup.find_all('th',text='Author')]
sent=[x.find_next('td').text.strip() for x in soup.find_all('th',text='Sent')]
body=[x.find_next('td').text.strip() for x in soup.find_all('th',text='Body')]
for item in zip(authors,sent,body):
print(item)
输出:
('John Doe', '2017-10-16 19:03:23 UTC', 'Test message with some body text')
('Jane Doe', '2017-10-17 08:03:23 UTC', 'Second test message with some body text')