Pandas read_csv忽略不符合的行

时间:2016-04-07 05:12:17

标签: python pandas

我正在从旧学校数据库中读取一张tsv表到Pandas。

数据如下所示:

Iron Oxide (FeO)    Fe1O1(cr,l)
T(K)    Cp      S      -[G-H(Tr)]/T   H-H(Tr)   delta-f H   delta-f G   log Kf
0                                                       
100                                                     
200                                                     
298.15  49.915  60.752  60.752         0.       -272.044    -251.429    44.049
300     49.999  61.061  60.753         0.092    -272.025    -251.301    43.755
400     51.840  75.704  62.737         5.187    -271.044    -244.543    31.934
.
.
.

我跳过第一行。第二行是一个8列标题(制表符分隔)。接下来的三行有一个数字和10个标签,之后的每一行都是8个字段。所以这三行都是个问题。

如果我试着这样读它:

import pandas as pd
FeO = pd.read_csv('JANAF-FeO.txt', skiprows=(0,), delimiter='\t', header=0)

然后我明白了:

enter image description here

所以我可以告诉大熊猫手动跳过这三行:

import pandas as pd
FeO = pd.read_csv('JANAF-FeO.txt', skiprows=(0,2,3,4), delimiter='\t', header=0)

没关系,我明白了:

enter image description here

如果我只是阅读一个文件,那就没问题,我会跳过这些行并完成。但是有很多文件,其中一些文件具有可变数量的几行,超过8列。那么有没有办法让pandas自动忽略不符合标题格式的行?

2 个答案:

答案 0 :(得分:3)

如果您需要更多通用解决方案,请尝试:

#number 15 in range(15) depends of max number of tabs, in my test data 
df1 = pd.read_csv('JANAF-FeO.txt', delimiter='\t', names=(range(15)))

#remove columns with all NaN
df1 = df1.dropna(axis=1, how='all')
df1.columns = df1.iloc[1,:]
df1 = df1[2:]

#mask if there are not 7 times NaN in line
mask = df1.isnull().sum(axis=1) != 7
df1 = df1[mask]

print df1

答案 1 :(得分:2)

听起来你的问题是在那些奇怪的单值行上挂了额外的标签。

幸运的是,sep参数采用正则表达式。我尽可能地重新创建了你的数据集,并从以下read_csv获得了一个体面的df:

ipdb> test = pd.read_csv('test.txt', skiprows=(1), header=(0), sep='\t*')
ipdb> test
 T(K)     Cp        S  -[G-H(Tr)]/T  H-H(Tr)  delta-f H  delta-f G  log Kf
0    0.00     NaN     NaN           NaN      NaN        NaN        NaN     NaN
1  100.00     NaN     NaN           NaN      NaN        NaN        NaN     NaN
2  200.00     NaN     NaN           NaN      NaN        NaN        NaN     NaN
3  298.15  49.915  60.752        60.752    0.100   -272.044   -251.429  44.049
4  300.00  49.999  61.061        60.753    0.092   -272.025   -251.301  43.755
5  400.00  51.840  75.704        62.737    5.187   -271.044   -244.543  31.934

希望这有帮助!