我有一个像这样的数据框:
phone_number_1_clean phone_number_2_clean phone_number_3_clean
NaN NaN 8546987
8316589 8751369 NaN
4569874 NaN 2645981
我希望尽可能地填充phone_number_1_clean
。这将需要将phone_number_2_clean
或phone_number_3_clean
移至phone_number_1_clean
,反之亦然,这意味着如果填充了phone_number_2_clean
,则应尽可能填充phone_number_1_clean
。
输出应类似于:
phone_number_1_clean phone_number_2_clean phone_number_3_clean
8546987 NaN NaN
8316589 8751369 NaN
4569874 2645981 NaN
我也许可以np.where
声明,但可能会很混乱。
该方法最好是矢量化的,因为将应用于大型数据帧。
答案 0 :(得分:1)
使用:
#for each row remove NaNs and create new Series - rows in final df
df1 = df.apply(lambda x: pd.Series(x.dropna().values), axis=1)
#if possible different number of columns like original df is necessary reindex
df1 = df1.reindex(columns=range(len(df.columns)))
#assign original columns names
df1.columns = df.columns
print (df1)
phone_number_1_clean phone_number_2_clean phone_number_3_clean
0 8546987 NaN NaN
1 8316589 8751369 NaN
2 4569874 2645981 NaN
或者:
s = df.stack()
s.index = [s.index.get_level_values(0), s.groupby(level=0).cumcount()]
df1 = s.unstack().reindex(columns=range(len(df.columns)))
df1.columns = df.columns
print (df1)
phone_number_1_clean phone_number_2_clean phone_number_3_clean
0 8546987 NaN NaN
1 8316589 8751369 NaN
2 4569874 2645981 NaN
或者justify
功能有所改变:
def justify(a, invalid_val=0, axis=1, side='left'):
"""
Justifies a 2D array
Parameters
----------
A : ndarray
Input array to be justified
axis : int
Axis along which justification is to be made
side : str
Direction of justification. It could be 'left', 'right', 'up', 'down'
It should be 'left' or 'right' for axis=1 and 'up' or 'down' for axis=0.
"""
if invalid_val is np.nan:
mask = pd.notnull(a) #changed to pandas notnull
else:
mask = a!=invalid_val
justified_mask = np.sort(mask,axis=axis)
if (side=='up') | (side=='left'):
justified_mask = np.flip(justified_mask,axis=axis)
out = np.full(a.shape, invalid_val, dtype=object)
if axis==1:
out[justified_mask] = a[mask]
else:
out.T[justified_mask.T] = a.T[mask.T]
return out
df = pd.DataFrame(justify(df.values, invalid_val=np.nan),
index=df.index, columns=df.columns)
print (df)
phone_number_1_clean phone_number_2_clean phone_number_3_clean
0 8546987 NaN NaN
1 8316589 8751369 NaN
2 4569874 2645981 NaN
性能:
#3k rows
df = pd.concat([df] * 1000, ignore_index=True)
In [442]: %%timeit
...: df1 = df.apply(lambda x: pd.Series(x.dropna().values), axis=1)
...: #if possible different number of columns like original df is necessary reindex
...: df1 = df1.reindex(columns=range(len(df.columns)))
...: #assign original columns names
...: df1.columns = df.columns
...:
1.17 s ± 10.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [443]: %%timeit
...: s = df.stack()
...: s.index = [s.index.get_level_values(0), s.groupby(level=0).cumcount()]
...:
...: df1 = s.unstack().reindex(columns=range(len(df.columns)))
...: df1.columns = df.columns
...:
...:
5.88 ms ± 74.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [444]: %%timeit
...: pd.DataFrame(justify(df.values, invalid_val=np.nan),
index=df.index, columns=df.columns)
...:
941 µs ± 131 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)