我正在从旧学校数据库中读取一张tsv表到Pandas。
数据如下所示:
Iron Oxide (FeO) Fe1O1(cr,l)
T(K) Cp S -[G-H(Tr)]/T H-H(Tr) delta-f H delta-f G log Kf
0
100
200
298.15 49.915 60.752 60.752 0. -272.044 -251.429 44.049
300 49.999 61.061 60.753 0.092 -272.025 -251.301 43.755
400 51.840 75.704 62.737 5.187 -271.044 -244.543 31.934
.
.
.
我跳过第一行。第二行是一个8列标题(制表符分隔)。接下来的三行有一个数字和10个标签,之后的每一行都是8个字段。所以这三行都是个问题。
如果我试着这样读它:
import pandas as pd
FeO = pd.read_csv('JANAF-FeO.txt', skiprows=(0,), delimiter='\t', header=0)
然后我明白了:
所以我可以告诉大熊猫手动跳过这三行:
import pandas as pd
FeO = pd.read_csv('JANAF-FeO.txt', skiprows=(0,2,3,4), delimiter='\t', header=0)
没关系,我明白了:
如果我只是阅读一个文件,那就没问题,我会跳过这些行并完成。但是有很多文件,其中一些文件具有可变数量的几行,超过8列。那么有没有办法让pandas自动忽略不符合标题格式的行?
答案 0 :(得分:3)
如果您需要更多通用解决方案,请尝试:
#number 15 in range(15) depends of max number of tabs, in my test data
df1 = pd.read_csv('JANAF-FeO.txt', delimiter='\t', names=(range(15)))
#remove columns with all NaN
df1 = df1.dropna(axis=1, how='all')
df1.columns = df1.iloc[1,:]
df1 = df1[2:]
#mask if there are not 7 times NaN in line
mask = df1.isnull().sum(axis=1) != 7
df1 = df1[mask]
print df1
答案 1 :(得分:2)
听起来你的问题是在那些奇怪的单值行上挂了额外的标签。
幸运的是,sep参数采用正则表达式。我尽可能地重新创建了你的数据集,并从以下read_csv获得了一个体面的df:
ipdb> test = pd.read_csv('test.txt', skiprows=(1), header=(0), sep='\t*')
ipdb> test
T(K) Cp S -[G-H(Tr)]/T H-H(Tr) delta-f H delta-f G log Kf
0 0.00 NaN NaN NaN NaN NaN NaN NaN
1 100.00 NaN NaN NaN NaN NaN NaN NaN
2 200.00 NaN NaN NaN NaN NaN NaN NaN
3 298.15 49.915 60.752 60.752 0.100 -272.044 -251.429 44.049
4 300.00 49.999 61.061 60.753 0.092 -272.025 -251.301 43.755
5 400.00 51.840 75.704 62.737 5.187 -271.044 -244.543 31.934
希望这有帮助!