从CSV中删除每列末尾的不同数量的NaN

时间:2017-04-11 03:32:20

标签: python pandas numpy scipy

用户在CSV文件的每列末尾总是有额外的空白。喜欢这个CSV:

847,73.3,809,74.9,655,80.6,694,45.5,647,47.8
848,24.3,810,23.1,656,18.2,695,48.6,648,47.3
566,26.1,541,7.8,438,19.1,463,45.5,433,18.2
567,0.5,542,0.1,439,0.2,464,53.1,434,0.2
426,0.0,407,0.0,330,0.0,348,98.6,326,0.0
...
339,37.9,324,74.9,,,349,1.4,,
340,62.0,325,25.1,,,,,,
341,0.1,326,0.0,,,,,,

使用pandas后转为NaN

pd.read_csv(ref_file)

结果

0                      694.0        45.5                     647.0        47.8  
1                      695.0        48.6                     648.0        47.3  
2                      696.0         5.6                     649.0         4.8  
3                      697.0         0.3                     650.0         0.2  
4                      698.0         0.0                     432.0        81.6  
5                      463.0        45.5                     433.0        18.2  
6                      464.0        53.1                     434.0         0.2  
7                      465.0         1.4                     324.0        81.6  
8                      466.0         0.0                     325.0        18.4  
9                      348.0        98.6                     326.0         0.0  
10                     349.0         1.4                       NaN         NaN  
11                       NaN         NaN                       NaN         NaN  
12                       NaN         NaN                       NaN         NaN 

我试过

df.last_valid_index()

但它仅检查第一列。所有这一列都有不同数量的NaN,在这种情况下如何去除NaN?

编辑:我试过.dropna()。根据NaN列的最大数量切割所有行,它不起作用。我想将每个列的数字切割成NaN,并且最后应该有不同的行数。

1 个答案:

答案 0 :(得分:2)

如果您希望每列都作为列表,并将这些列表作为系列

df.T.stack().groupby(level=0).apply(list)

0    [847.0, 848.0, 566.0, 567.0, 426.0, 339.0, 340...
1        [73.3, 24.3, 26.1, 0.5, 0.0, 37.9, 62.0, 0.1]
2    [809.0, 810.0, 541.0, 542.0, 407.0, 324.0, 325...
3         [74.9, 23.1, 7.8, 0.1, 0.0, 74.9, 25.1, 0.0]
4                  [655.0, 656.0, 438.0, 439.0, 330.0]
5                         [80.6, 18.2, 19.1, 0.2, 0.0]
6           [694.0, 695.0, 463.0, 464.0, 348.0, 349.0]
7                  [45.5, 48.6, 45.5, 53.1, 98.6, 1.4]
8                  [647.0, 648.0, 433.0, 434.0, 326.0]
9                         [47.8, 47.3, 18.2, 0.2, 0.0]
dtype: object

否则,如果您希望每行都作为列表。

df.stack().groupby(level=0).apply(list)

0    [847.0, 73.3, 809.0, 74.9, 655.0, 80.6, 694.0,...
1    [848.0, 24.3, 810.0, 23.1, 656.0, 18.2, 695.0,...
2    [566.0, 26.1, 541.0, 7.8, 438.0, 19.1, 463.0, ...
3    [567.0, 0.5, 542.0, 0.1, 439.0, 0.2, 464.0, 53...
4    [426.0, 0.0, 407.0, 0.0, 330.0, 0.0, 348.0, 98...
5               [339.0, 37.9, 324.0, 74.9, 349.0, 1.4]
6                           [340.0, 62.0, 325.0, 25.1]
7                             [341.0, 0.1, 326.0, 0.0]
dtype: object