我倾向于定期获取具有大量类似列的数据文件,但对于每一行,其中只有一列实际上有任何数据。虽然有时它只是那样。理想情况下,我想要做的是有一个函数,我可以输入要检查的列列表,对于任何只包含1个值的行,都有一行将这些列组合在一起并将该列更改为NaN,这样我就可以轻松删除最后多余的列。如果多列具有数据而不是该行的合并/更改。
所以例如我有这个DF
df = pd.DataFrame({
"id": pd.Series([1,2,3,4,5,6,7]),
"a1": pd.Series(['a',np.NaN,np.NaN,'c','d',np.NaN, np.NaN]),
"a2": ([np.NaN,'b','c',np.NaN,'d','e', np.NaN]),
"a3": ([np.NaN,np.NaN,np.NaN,np.NaN,np.NaN,np.NaN, 'f'])
})
现在代码明智我有这个
import pandas as pd
import numpy as np
def test(row, index, combined):
values = 0
foundix = 0
#check which if any column has data
for ix in index:
if not (pd.isnull(row[ix])):
values = values + 1
foundix = ix
#check that it found only 1 value, if so clean up
if (values == 1):
row[combined] = row[foundix]
for ix in index:
row[ix] = np.NaN
return row
df["a"] = np.NaN
df.apply(lambda x: test(x, ["a1", "a2", "a3"], "a"), 1)
print df
我的代码存在的问题是
我的理想输出是(主要是为了帮助清理数据并处理奇怪的情况):
a1 a2 a3 id a
0 NaN NaN NaN 1 a
1 NaN NaN NaN 2 b
2 NaN NaN NaN 3 c
3 NaN NaN NaN 4 c
4 d d NaN 5 NaN
5 NaN NaN NaN 6 e
6 NaN NaN NaN 7 f
答案 0 :(得分:0)
我的方法似乎稍快一些:
In [415]:
df = pd.DataFrame({
"id": pd.Series([1,2,3,4,5,6,7]),
"a1": pd.Series(['a',np.NaN,np.NaN,'c','d',np.NaN, np.NaN]),
"a2": ([np.NaN,'b','c',np.NaN,'d','e', np.NaN]),
"a3": ([np.NaN,np.NaN,np.NaN,np.NaN,np.NaN,np.NaN, 'f'])
})
df
Out[415]:
a1 a2 a3 id
0 a NaN NaN 1
1 NaN b NaN 2
2 NaN c NaN 3
3 c NaN NaN 4
4 d d NaN 5
5 NaN e NaN 6
6 NaN NaN f 7
[7 rows x 4 columns]
In [416]:
def gen_col(x):
if len(x.dropna()) > 1:
return NaN
else:
return x.dropna().values.max()
import pandas as pd
import numpy as np
def test(row, index, combined):
values = 0
foundix = 0
#check which if any column has data
for ix in index:
if not (pd.isnull(row[ix])):
values = values + 1
foundix = ix
#check that it found only 1 value, if so clean up
if (values == 1):
row[combined] = row[foundix]
for ix in index:
row[ix] = np.NaN
return row
%timeit df.apply(lambda x: test(x, ["a1", "a2", "a3"], "a"), 1)
%timeit df['a'] = df[['a1','a2','a3']].apply(lambda row: gen_col(row), axis=1)
df
100 loops, best of 3: 7.08 ms per loop
100 loops, best of 3: 3.24 ms per loop
Out[416]:
a1 a2 a3 id a
0 a NaN NaN 1 a
1 NaN b NaN 2 b
2 NaN c NaN 3 c
3 c NaN NaN 4 c
4 d d NaN 5 NaN
5 NaN e NaN 6 e
6 NaN NaN f 7 f
[7 rows x 5 columns]
我在这里做的关键是在删除所有NaN
值后检查值的数量,这似乎比你的代码更快