使用Pandas阅读CSV并处理评论

时间:2014-08-28 23:56:58

标签: python csv pandas

这是我尝试使用 Pandas 读取的数据文件的示例。所有文件都有不同数量的评论行,但所有文件都以BEGIN开头,并以END结束,之后可能是换行符。

!Example data file
!With commands delimited by exclamation points
!Not always the some number of comment lines
BEGIN
300,-1.0342501,-0.07359
5298,-0.9889674,0.06514
1029,-0.981307,0.130398
1529,-0.971765,0.1945281
END

这是我的Pandas用于阅读这些文件。

b = pd.read_csv(data_file,,names=['Frequency','Real','Imaginary'],comment='!') 

我遇到了两个问题,首先是它读取所有行,只是填充注释行只是 NaN 并读取{{1 }和BEGIN标记。这也导致了单元格索引的偏移,这是我的第二个问题。

通过删除评论行以及ENDBEGIN标记将正确的Pandas读入数据框会是什么?是否有一行优雅的代码可以解决我的两个问题?

2 个答案:

答案 0 :(得分:2)

如何导入整个文件并删除第二个字段为空的所有内容?

import pandas as pd
import numpy as np    
b = pd.read_csv('sample2.csv',names=['Frequency','Real','Imaginary'],comment='!')
isnotnan = lambda x: not(np.isnan(x))
b2 = b[b['Real'].apply(isnotnan)]

结果b:

  Frequency      Real  Imaginary
0       NaN       NaN        NaN
1       NaN       NaN        NaN
2       NaN       NaN        NaN
3     BEGIN       NaN        NaN
4       300 -1.034250  -0.073590
5      5298 -0.988967   0.065140
6      1029 -0.981307   0.130398
7      1529 -0.971765   0.194528
8       END       NaN        NaN

结果b2:

  Frequency      Real  Imaginary
4       300 -1.034250  -0.073590
5      5298 -0.988967   0.065140
6      1029 -0.981307   0.130398
7      1529 -0.971765   0.194528

要重置索引:

b3 = b2.reset_index(drop = True)

b3的输出:

  Frequency      Real  Imaginary
0       300 -1.034250  -0.073590
1      5298 -0.988967   0.065140
2      1029 -0.981307   0.130398
3      1529 -0.971765   0.194528

答案 1 :(得分:1)

以下是您的代码的变体:

In [125]: df = pd.read_csv('data_file.csv', comment='!', header=0, names=['Frequency','Real','Imaginary'], na_values=['END'])

In [126]: df
Out[126]: 
   Frequency      Real  Imaginary
0        300 -1.034250  -0.073590
1       5298 -0.988967   0.065140
2       1029 -0.981307   0.130398
3       1529 -0.971765   0.194528
4        NaN       NaN        NaN

'结束'在最后一行中转换为NaN,因此我们将删除最后一行:

In [127]: df = df.iloc[:-1]    # or `df = df.dropna()`

In [128]: df
Out[128]: 
   Frequency      Real  Imaginary
0        300 -1.034250  -0.073590
1       5298 -0.988967   0.065140
2       1029 -0.981307   0.130398
3       1529 -0.971765   0.194528