我有一个熊猫数据框df
,其外观如下:
0 1 2 3 4 5 6
0 3 74
1 4 2
2 -9
3 -1 2 -16 -21
4 1
5 28
我想从上面删除所有nan
并重新对齐每一行中的数据以获取以下内容:
0 1 2 3
0 3 74
1 4 2
2 -9
3 -1 2 -16 -21
4 1
5 28
基本上,我试图在删除nan
之后使每一行中的所有数据左对齐。我不确定如何进行此操作。
答案 0 :(得分:1)
首先将所有非缺失值移动justify
,然后使用https://flask-socketio.readthedocs.io/en/latest/仅删除NaN
列:
arr = justify(df.to_numpy(), invalid_val=np.nan)
df = pd.DataFrame(arr).dropna(axis=1, how='all')
print (df)
0 1 2 3
0 3.0 74.0 NaN NaN
1 4.0 2.0 NaN NaN
2 -9.0 NaN NaN NaN
3 -1.0 2.0 -16.0 -21.0
4 1.0 NaN NaN NaN
5 28.0 NaN NaN NaN
#https://stackoverflow.com/a/44559180/2901002
def justify(a, invalid_val=0, axis=1, side='left'):
"""
Justifies a 2D array
Parameters
----------
A : ndarray
Input array to be justified
axis : int
Axis along which justification is to be made
side : str
Direction of justification. It could be 'left', 'right', 'up', 'down'
It should be 'left' or 'right' for axis=1 and 'up' or 'down' for axis=0.
"""
if invalid_val is np.nan:
mask = ~np.isnan(a)
else:
mask = a!=invalid_val
justified_mask = np.sort(mask,axis=axis)
if (side=='up') | (side=='left'):
justified_mask = np.flip(justified_mask,axis=axis)
out = np.full(a.shape, invalid_val)
if axis==1:
out[justified_mask] = a[mask]
else:
out.T[justified_mask.T] = a.T[mask.T]
return out
答案 1 :(得分:1)
此解决方案将数据带入numpy区域,使用numpy isnan和numpy compress运行一些计算,创建单个数据帧,然后使用pandas concat集总成一个数据帧:
data = """ 0 1 2 3 4 5 6
3 74 None None None None None
4 2 None None None None None
None None -9 None None None None
None None None -1 2 -16 -21
None None 1 None None None None
None None 28 None None None None """
df = pd.read_csv(StringIO(data), sep='\s{2,}',engine='python', na_values=["None"])
df
0 1 2 3 4 5 6
0 3.0 74.0 NaN NaN NaN NaN NaN
1 4.0 2.0 NaN NaN NaN NaN NaN
2 NaN NaN -9.0 NaN NaN NaN NaN
3 NaN NaN NaN -1.0 2.0 -16.0 -21.0
4 NaN NaN 1.0 NaN NaN NaN NaN
5 NaN NaN 28.0NaN NaN NaN NaN
#convert to numpy array
M = df.to_numpy()
#get True or False depending on the null status of each entry
condition = ~np.isnan(M)
#for each array, get entries that are not null
step1 = [np.compress(ent,arr) for ent,arr in zip(condition,M)]
step1
#concatenate each dataframe
step2 = pd.concat([pd.DataFrame(ent).T for ent in step1],ignore_index=True)
print(step2)
0 1 2 3
0 3.0 74.0 NaN NaN
1 4.0 2.0 NaN NaN
2 -9.0 NaN NaN NaN
3 -1.0 2.0 -16.0 -21.0
4 1.0 NaN NaN NaN
5 28.0 NaN NaN NaN
#alternatively, from step1 we could find the longest array and use that value to resize all the other arrays :
reshape = max(len(arr) for arr in step1)
#this happens in place
[arr.resize(reshape,refcheck=False) for arr in step1]
outcome = pd.DataFrame(step1).where(lambda x: x.ne(0),np.nan)