从熊猫数据框中删除Nan并重塑数据框

时间:2020-05-03 05:27:27

标签: pandas python-3.8

我有一个熊猫数据框df,其外观如下:

    0    1    2   3    4     5    6
0   3    74                 
1   4    2                  
2            -9             
3                 -1   2    -16   -21
4             1             
5             28                

我想从上面删除所有nan并重新对齐每一行中的数据以获取以下内容:

    0   1    2   3
0   3   74      
1   4   2       
2   -9          
3   -1  2   -16  -21
4   1           
5   28  

基本上,我试图在删除nan之后使每一行中的所有数据左对齐。我不确定如何进行此操作。

2 个答案:

答案 0 :(得分:1)

首先将所有非缺失值移动justify,然后使用https://flask-socketio.readthedocs.io/en/latest/仅删除NaN列:

arr = justify(df.to_numpy(), invalid_val=np.nan)
df = pd.DataFrame(arr).dropna(axis=1, how='all')
print (df)
      0     1     2     3
0   3.0  74.0   NaN   NaN
1   4.0   2.0   NaN   NaN
2  -9.0   NaN   NaN   NaN
3  -1.0   2.0 -16.0 -21.0
4   1.0   NaN   NaN   NaN
5  28.0   NaN   NaN   NaN

#https://stackoverflow.com/a/44559180/2901002
def justify(a, invalid_val=0, axis=1, side='left'):    
    """
    Justifies a 2D array

    Parameters
    ----------
    A : ndarray
        Input array to be justified
    axis : int
        Axis along which justification is to be made
    side : str
        Direction of justification. It could be 'left', 'right', 'up', 'down'
        It should be 'left' or 'right' for axis=1 and 'up' or 'down' for axis=0.

    """

    if invalid_val is np.nan:
        mask = ~np.isnan(a)
    else:
        mask = a!=invalid_val
    justified_mask = np.sort(mask,axis=axis)
    if (side=='up') | (side=='left'):
        justified_mask = np.flip(justified_mask,axis=axis)
    out = np.full(a.shape, invalid_val) 
    if axis==1:
        out[justified_mask] = a[mask]
    else:
        out.T[justified_mask.T] = a.T[mask.T]
    return out

答案 1 :(得分:1)

此解决方案将数据带入numpy区域,使用numpy isnannumpy compress运行一些计算,创建单个数据帧,然后使用pandas concat集总成一个数据帧:

data = """    0    1    2   3    4     5    6
              3    74   None  None  None    None   None   
              4    2    None  None  None    None   None   
              None   None  -9   None  None    None   None 
              None   None  None   -1   2    -16   -21
              None   None  1    None  None    None   None
              None   None  28   None  None    None   None         """

df = pd.read_csv(StringIO(data), sep='\s{2,}',engine='python', na_values=["None"])
df

      0 1        2  3      4     5        6
0   3.0 74.0    NaN NaN  NaN    NaN      NaN
1   4.0 2.0     NaN NaN  NaN    NaN      NaN
2   NaN NaN    -9.0 NaN  NaN    NaN      NaN
3   NaN NaN     NaN -1.0 2.0   -16.0    -21.0
4   NaN NaN     1.0 NaN NaN     NaN      NaN
5   NaN NaN     28.0NaN NaN     NaN      NaN

#convert to numpy array
M = df.to_numpy()

#get True or False depending on the null status of each entry
condition = ~np.isnan(M)

#for each array, get entries that are not null
step1 = [np.compress(ent,arr) for ent,arr in zip(condition,M)]
step1

#concatenate each dataframe 
step2 = pd.concat([pd.DataFrame(ent).T for ent in step1],ignore_index=True)
print(step2)

     0        1     2   3
0   3.0     74.0    NaN NaN
1   4.0     2.0     NaN NaN
2   -9.0    NaN     NaN NaN
3   -1.0    2.0   -16.0 -21.0
4   1.0     NaN    NaN  NaN
5   28.0    NaN    NaN  NaN

#alternatively, from step1 we could find the longest array and use that value to resize all the other arrays :
reshape = max(len(arr) for arr in step1)
#this happens in place
[arr.resize(reshape,refcheck=False) for arr in step1]
outcome = pd.DataFrame(step1).where(lambda x: x.ne(0),np.nan)