我使用Python pandas读取数据帧如下:
<style type="text/css">
table.tableizer-table {
font-size: 12px;
border: 1px solid #CCC;
font-family: Arial, Helvetica, sans-serif;
}
.tableizer-table td {
padding: 4px;
margin: 3px;
border: 1px solid #CCC;
}
.tableizer-table th {
background-color: #104E8B;
color: #FFF;
font-weight: bold;
}
</style>
<table class="tableizer-table">
<thead><tr class="tableizer-firstrow"><th>Time</th><th>Angle</th><th>Angle</th><th>Angle</th><th>Angle</th><th>FUEL_1</th><th>FUEL_2</th><th>Speed</th></tr></thead><tbody>
<tr><td>3:06:38</td><td>5.3</td><td>5.3</td><td>5.3</td><td>5.3</td><td>1150</td><td> </td><td>1328</td></tr>
<tr><td>3:06:39</td><td>5.3</td><td>5.3</td><td>5.3</td><td>5.3</td><td> </td><td> </td><td>1328</td></tr>
<tr><td>3:06:40</td><td>5.3</td><td>5.3</td><td>5.3</td><td>5.3</td><td> </td><td>1150</td><td>1344</td></tr>
<tr><td>3:06:41</td><td>5.3</td><td>5.6</td><td>5.6</td><td>5.6</td><td> </td><td> </td><td>1392</td></tr>
<tr><td>3:06:42</td><td>5.6</td><td>5.6</td><td>5.6</td><td>5.6</td><td>1160</td><td> </td><td>1456</td></tr>
<tr><td>3:06:43</td><td>5.6</td><td>5.6</td><td>6</td><td>6</td><td> </td><td> </td><td>1520</td></tr>
<tr><td>3:06:44</td><td>6</td><td>6</td><td>6</td><td>6</td><td> </td><td>1160</td><td>1600</td></tr>
<tr><td>3:06:45</td><td>6</td><td>6</td><td>6</td><td>6.3</td><td> </td><td> </td><td>1696</td></tr>
</tbody></table>
我想创建以下数据框:
<style type="text/css">
table.tableizer-table {
font-size: 12px;
border: 1px solid #CCC;
font-family: Arial, Helvetica, sans-serif;
}
.tableizer-table td {
padding: 4px;
margin: 3px;
border: 1px solid #CCC;
}
.tableizer-table th {
background-color: #104E8B;
color: #FFF;
font-weight: bold;
}
</style>
<table class="tableizer-table">
<thead><tr class="tableizer-firstrow"><th>Time</th><th>Angle</th><th>FUEL_1</th><th>FUEL_2</th><th>Speed</th></tr></thead><tbody>
<tr><td>3:06:38</td><td>5.3</td><td>1150</td><td> </td><td>1328</td></tr>
<tr><td> </td><td>5.3</td><td> </td><td> </td><td> </td></tr>
<tr><td> </td><td>5.3</td><td> </td><td> </td><td> </td></tr>
<tr><td> </td><td>5.3</td><td> </td><td> </td><td> </td></tr>
<tr><td>3:06:39</td><td>5.3</td><td> </td><td> </td><td>1328</td></tr>
<tr><td> </td><td>5.3</td><td> </td><td> </td><td> </td></tr>
<tr><td> </td><td>5.3</td><td> </td><td> </td><td> </td></tr>
<tr><td> </td><td>5.3</td><td> </td><td> </td><td> </td></tr>
<tr><td>3:06:40</td><td>5.3</td><td> </td><td>1150</td><td>1344</td></tr>
<tr><td> </td><td>5.3</td><td> </td><td> </td><td> </td></tr>
<tr><td> </td><td>5.3</td><td> </td><td> </td><td> </td></tr>
<tr><td> </td><td>5.3</td><td> </td><td> </td><td> </td></tr>
<tr><td>3:06:41</td><td>5.3</td><td> </td><td> </td><td>1392</td></tr>
<tr><td> </td><td>5.6</td><td> </td><td> </td><td> </td></tr>
<tr><td> </td><td>5.6</td><td> </td><td> </td><td> </td></tr>
<tr><td> </td><td>5.6</td><td> </td><td> </td><td> </td></tr>
<tr><td>3:06:42</td><td>5.6</td><td>1160</td><td> </td><td>1456</td></tr>
<tr><td> </td><td>5.6</td><td> </td><td> </td><td> </td></tr>
<tr><td> </td><td>5.6</td><td> </td><td> </td><td> </td></tr>
<tr><td> </td><td>5.6</td><td> </td><td> </td><td> </td></tr>
<tr><td>3:06:43</td><td>5.6</td><td> </td><td> </td><td>1520</td></tr>
<tr><td> </td><td>5.6</td><td> </td><td> </td><td> </td></tr>
<tr><td> </td><td>6</td><td> </td><td> </td><td> </td></tr>
<tr><td> </td><td>6</td><td> </td><td> </td><td> </td></tr>
<tr><td>3:06:44</td><td>6</td><td> </td><td>1160</td><td>1600</td></tr>
<tr><td> </td><td>6</td><td> </td><td> </td><td> </td></tr>
<tr><td> </td><td>6</td><td> </td><td> </td><td> </td></tr>
<tr><td> </td><td>6</td><td> </td><td> </td><td> </td></tr>
<tr><td>3:06:45</td><td>6</td><td> </td><td> </td><td>1696</td></tr>
<tr><td> </td><td>6</td><td> </td><td> </td><td> </td></tr>
<tr><td> </td><td>6</td><td> </td><td> </td><td> </td></tr>
<tr><td> </td><td>6.3</td><td> </td><td> </td><td></td></tr>
</tbody></table>
我的想法是通过'Time','FUEL_1','FUEL_2','Speed'插入几个空列,然后逐个堆叠这些列然后合并它们。你有更简单的想法吗?
答案 0 :(得分:0)
所以我很确定使用pandas.read_html
可以很容易地做到这一点,但我不像BeautifulSoup那样熟悉。
html = """<table class="tableizer-table">
<thead><tr class="tableizer-firstrow"><th>Time</th><th>Angle</th><th>Angle</th><th>Angle</th><th>Angle</th><th>FUEL_1</th><th>FUEL_2</th><th>Speed</th></tr></thead><tbody>
<tr><td>3:06:38</td><td>5.3</td><td>5.3</td><td>5.3</td><td>5.3</td><td>1150</td><td> </td><td>1328</td></tr>
<tr><td>3:06:39</td><td>5.3</td><td>5.3</td><td>5.3</td><td>5.3</td><td> </td><td> </td><td>1328</td></tr>
<tr><td>3:06:40</td><td>5.3</td><td>5.3</td><td>5.3</td><td>5.3</td><td> </td><td>1150</td><td>1344</td></tr>
<tr><td>3:06:41</td><td>5.3</td><td>5.6</td><td>5.6</td><td>5.6</td><td> </td><td> </td><td>1392</td></tr>
<tr><td>3:06:42</td><td>5.6</td><td>5.6</td><td>5.6</td><td>5.6</td><td>1160</td><td> </td><td>1456</td></tr>
<tr><td>3:06:43</td><td>5.6</td><td>5.6</td><td>6</td><td>6</td><td> </td><td> </td><td>1520</td></tr>
<tr><td>3:06:44</td><td>6</td><td>6</td><td>6</td><td>6</td><td> </td><td>1160</td><td>1600</td></tr>
<tr><td>3:06:45</td><td>6</td><td>6</td><td>6</td><td>6.3</td><td> </td><td> </td><td>1696</td></tr>
</tbody></table>"""
import pandas as pd
from bs4 import BeautifulSoup
def read_table(html):
header, matrix = [], []
bs = BeautifulSoup(html, "html.parser")
for row in bs.findAll("tr"):
if(row.find("th")):
header = [ r.get_text().strip() for r in row.findAll("th") ]
else: #td
matrix.append([ r.get_text().strip() for r in row.findAll("td") ])
df = pd.DataFrame(matrix, columns=header)
return df
将您提供的html传递给此函数将返回一个熊猫的数据框,然后您可以选择所需的列。
df = read_table(html)
df[["Time","FUEL_1","FUEL_2","Speed"]]
Time FUEL_1 FUEL_2 Speed
0 3:06:38 1150 1328
1 3:06:39 1328
2 3:06:40 1150 1344
3 3:06:41 1392
4 3:06:42 1160 1456
5 3:06:43 1520
6 3:06:44 1160 1600
7 3:06:45 1696