这是我需要从中提取列“ 操作”之后的所有列的html。 Data.html:
<table border="1"><tr><th>Central Repository</th><td><table border="1"><tr><th>Passadena-USA</th><td><table border="1"><tr><th>Fairfax Av.</th><td><table border="1"><tr><th>CMS</th><td><table border="1"><tr><th>action</th><th>address</th><th>machinie_id</th><th>portal</th><th>supplier</th><th>created_by</th><th>date</th><th>portal deficit</th><th>Load Value 1</th><th>Load Value 2</th><th>Load Value 3</th><th>Load Value 4</th><th>Load Value 5</th><th>Sub Load 1</th><th>Sub Load 2</th><th>Sub Load 3</th><th>Sub Load 4</th><th>Sub Load 5</th><th>Coordinates</th><th>Area Code</th><th>pending case id</th><th>project details</th><th>identification number APAC</th><th>site_id</th><th>state</th><th>status</th><th>timestamp</th></tr><tr><td>FP</td><td>1195 Fairfax Avenue </td><td>ZEBA 5841</td><td>NHE-9850</td><td>CMS</td><td>Administrator</td><td>2017/6/19</td><td>687965</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>Relay 4-12 Avery J</td><td>Tonal One B</td><td>2602700</td><td>Tertiary Node</td><td>0</td><td>Volume Sub < 1</td><td>passadena</td><td>PA</td><td>2017/06/19 17:35:56</td></tr></table></td></tr></table></td></tr></table></td></tr></table></td></tr></table>
这是我的Python代码:
import pandas as pd
df = pd.read_html(Data.html)
print(df[3])
# shouldn't the index 3 return all the columns that come after "CMS"
答案 0 :(得分:1)
索引3不应返回“ CMS”之后的所有列
应该提到的是pd.read_html
函数返回
dfs : list of DataFrames
和df[3]
仅包含其中一个数据框。
要将表格标题单元格(<th>action</th><th>address</th><th>machinie_id</th>....
)用作列名-将header
选项设置为1
(行号)。
header :int或类似列表的内容,或者无,可选
要用于的行(或:class:~pandas.MultiIndex
的行列表) 制作列标题。
测试:
In [21]: df = pd.read_html('data.html', header=1)
In [22]: df[3].columns
Out[22]:
Index(['action', 'address', 'machinie_id', 'portal', 'supplier', 'created_by',
'date', 'portal deficit', 'Load Value 1', 'Load Value 2',
'Load Value 3', 'Load Value 4', 'Load Value 5', 'Sub Load 1',
'Sub Load 2', 'Sub Load 3', 'Sub Load 4', 'Sub Load 5', 'Coordinates',
'Area Code', 'pending case id', 'project details',
'identification number APAC', 'site_id', 'state', 'status', 'timestamp',
'Unnamed: 27', 'Unnamed: 28', 'Unnamed: 29', 'Unnamed: 30',
'Unnamed: 31', 'Unnamed: 32', 'Unnamed: 33', 'Unnamed: 34',
'Unnamed: 35', 'Unnamed: 36', 'Unnamed: 37', 'Unnamed: 38',
'Unnamed: 39', 'Unnamed: 40', 'Unnamed: 41', 'Unnamed: 42',
'Unnamed: 43', 'Unnamed: 44', 'Unnamed: 45', 'Unnamed: 46',
'Unnamed: 47', 'Unnamed: 48', 'Unnamed: 49', 'Unnamed: 50',
'Unnamed: 51', 'Unnamed: 52', 'Unnamed: 53', 'Unnamed: 54',
'Unnamed: 55'],
dtype='object')
In [23]: