如果其他值为空,则将列合并在一起

时间:2014-04-24 14:55:25

标签: python pandas

我倾向于定期获取具有大量类似列的数据文件,但对于每一行,其中只有一列实际上有任何数据。虽然有时它只是那样。理想情况下,我想要做的是有一个函数,我可以输入要检查的列列表,对于任何只包含1个值的行,都有一行将这些列组合在一起并将该列更改为NaN,这样我就可以轻松删除最后多余的列。如果多列具有数据而不是该行的合并/更改。

所以例如我有这个DF

df = pd.DataFrame({
               "id": pd.Series([1,2,3,4,5,6,7]),
               "a1": pd.Series(['a',np.NaN,np.NaN,'c','d',np.NaN, np.NaN]), 
               "a2": ([np.NaN,'b','c',np.NaN,'d','e', np.NaN]), 
               "a3": ([np.NaN,np.NaN,np.NaN,np.NaN,np.NaN,np.NaN, 'f'])
               })

现在代码明智我有这个

import pandas as pd
import numpy as np    
def test(row, index, combined):
    values = 0
    foundix = 0
    #check which if any column has data
    for ix in index:
        if not (pd.isnull(row[ix])):
            values = values + 1
            foundix = ix
    #check that it found only 1 value, if so clean up
    if (values == 1):
        row[combined] = row[foundix]
        for ix in index:
            row[ix] = np.NaN
    return row

df["a"] = np.NaN
df.apply(lambda x: test(x, ["a1", "a2", "a3"], "a"), 1)
print df

我的代码存在的问题是

  1. 我觉得这是解决问题的错误方向
  2. 我不完全了解如何让我的apply函数实际应用于行来更改它。
  3. 我的理想输出是(主要是为了帮助清理数据并处理奇怪的情况):

       a1   a2   a3   id  a
    0  NaN  NaN  NaN   1  a
    1  NaN  NaN  NaN   2  b
    2  NaN  NaN  NaN   3  c
    3  NaN  NaN  NaN   4  c
    4    d    d  NaN   5  NaN
    5  NaN  NaN  NaN   6  e
    6  NaN  NaN  NaN   7  f
    

1 个答案:

答案 0 :(得分:0)

我的方法似乎稍快一些:

In [415]:

df = pd.DataFrame({
               "id": pd.Series([1,2,3,4,5,6,7]),
               "a1": pd.Series(['a',np.NaN,np.NaN,'c','d',np.NaN, np.NaN]), 
               "a2": ([np.NaN,'b','c',np.NaN,'d','e', np.NaN]), 
               "a3": ([np.NaN,np.NaN,np.NaN,np.NaN,np.NaN,np.NaN, 'f'])
               })
df
Out[415]:
    a1   a2   a3  id
0    a  NaN  NaN   1
1  NaN    b  NaN   2
2  NaN    c  NaN   3
3    c  NaN  NaN   4
4    d    d  NaN   5
5  NaN    e  NaN   6
6  NaN  NaN    f   7

[7 rows x 4 columns]
In [416]:

def gen_col(x):
    if len(x.dropna()) > 1:
        return NaN
    else:
        return x.dropna().values.max()

import pandas as pd
import numpy as np    
def test(row, index, combined):
    values = 0
    foundix = 0
    #check which if any column has data
    for ix in index:
        if not (pd.isnull(row[ix])):
            values = values + 1
            foundix = ix
    #check that it found only 1 value, if so clean up
    if (values == 1):
        row[combined] = row[foundix]
        for ix in index:
            row[ix] = np.NaN
    return row
%timeit df.apply(lambda x: test(x, ["a1", "a2", "a3"], "a"), 1)
%timeit df['a'] = df[['a1','a2','a3']].apply(lambda row: gen_col(row), axis=1)
df
100 loops, best of 3: 7.08 ms per loop
100 loops, best of 3: 3.24 ms per loop
Out[416]:
    a1   a2   a3  id    a
0    a  NaN  NaN   1    a
1  NaN    b  NaN   2    b
2  NaN    c  NaN   3    c
3    c  NaN  NaN   4    c
4    d    d  NaN   5  NaN
5  NaN    e  NaN   6    e
6  NaN  NaN    f   7    f

[7 rows x 5 columns]

我在这里做的关键是在删除所有NaN值后检查值的数量,这似乎比你的代码更快