我正在尝试从出现在网页上的表格中提取一些信息,但是该表格是非结构化的,行是标题,列是这样的内容:(我很抱歉未公开网页)
<table class="table-detail">
<tbody>
<tr>
<td colspan="4" class="noborder">General Information
</td>
</tr>
<tr>
<th>Full name</th>
<td>
James Smith
</td>
<th>Year of birth</th>
<td>1992</td>
</tr>
<tr>
<th>Gender</th>
<td>Male</td>
</tr>
<tr>
<th>Place of birth</th>
<td>TTexas, USA</td>
<td> </td>
<td> </td>
</tr>
<tr>
<th>Address</th>
<td>Texas, USA</td>
<td> </td>
<td></td>
</tr>
目前,我可以使用以下脚本提取表格:
import pandas as pd
import requests
url = "example.com"
r = requests.get(url)
df_list = pd.read_html(r.text)
df = df_list[0]
df.head()
df.to_csv('myfile.csv',encoding='utf-8-sig')
该表基本上如下所示:
但是,我对如何在Python上实现此目标有些困惑。我似乎无法全力以赴地获取数据。我想要的结果如下:
任何帮助将不胜感激。提前非常感谢您。
答案 0 :(得分:2)
您可以使用beautifulsoup
来解析HTML。例如:
import pandas as pd
from bs4 import BeautifulSoup
txt = '''<table class="table-detail">
<tbody>
<tr>
<td colspan="4" class="noborder">General Information
</td>
</tr>
<tr>
<th>Full name</th>
<td>
James Smith
</td>
<th>Year of birth</th>
<td>1992</td>
</tr>
<tr>
<th>Gender</th>
<td>Male</td>
</tr>
<tr>
<th>Place of birth</th>
<td>TTexas, USA</td>
<td> </td>
<td> </td>
</tr>
<tr>
<th>Address</th>
<td>Texas, USA</td>
<td> </td>
<td></td>
</tr>'''
soup = BeautifulSoup(txt, 'html.parser')
row = {}
for h in soup.select('th:has(+td)'):
row[h.text] = h.find_next('td').get_text(strip=True)
df = pd.DataFrame([row])
print(df)
打印:
Full name Year of birth Gender Place of birth Address
0 James Smith 1992 Male TTexas, USA Texas, USA