如何使用Pandas重构简单的数据帧解析代码

时间:2016-11-21 12:51:18

标签: python parsing pandas dataframe refactoring

我正在使用Pandas来解析我创建的数据框:

# Initial DF    
A    B    C
0  -1  qqq  XXX
1  20  www  CCC
2  30  eee  VVV
3  -1  rrr  BBB
4  50  ttt  NNN
5  60  yyy  MMM
6  70  uuu  LLL
7  -1  iii  KKK
8  -1  ooo  JJJ

我的目标是分析A列并将以下条件应用于数据框:

  1. 调查每一行
  2. 确定是否df['A'].iloc[index]=-1
  3. 如果为true且index=0将第一行标记为要删除
  4. 如果为true,则index=N标记要移除的最后一行
  5. 如果0<index<Ndf['A'].iloc[index]=-1以及前一行或后一行包含-1(df['A'].iloc[index+]=-1df['A'].iloc[index-1]=-1),将行标记为要删除;别的替换 -1表示前一个和后一个值的平均值
  6. 最终的数据框应如下所示:

    # Final DF    
    A    B    C
    0  20  www  CCC
    1  30  eee  VVV
    2  40  rrr  BBB
    3  50  ttt  NNN
    4  60  yyy  MMM
    5  70  uuu  LLL
    

    我能够通过编写一个应用上述条件的简单代码来实现我的目标:

    将pandas导入为pd

    # create dataframe
    data = {'A':[-1,20,30,-1,50,60,70,-1,-1],
            'B':['qqq','www','eee','rrr','ttt','yyy','uuu','iii','ooo'],
            'C':['XXX','CCC','VVV','BBB','NNN','MMM','LLL','KKK','JJJ']}
    df = pd.DataFrame(data)
    
    # If df['A'].iloc[index]==-1:
    #   - option 1: remove row if first or last row are equal to -1
    #   - option 2: remove row if previous or following row contains -1 (df['A'].iloc[index-1]==-1 or df['A'].iloc[index+1]==-1)
    #   - option 3: replace df['A'].iloc[index] if: df['A'].iloc[index]==-1 and (df['A'].iloc[index-1]==-1 or df['A'].iloc[index+1]==-1)
    N = len(df.index) # number of rows
    index_vect = []   # store indexes of rows to be deleated
    for index in range(0,N):
    
        # option 1
        if index==0 and df['A'].iloc[index]==-1:
            index_vect.append(index)
        elif index>1 and index<N and df['A'].iloc[index]==-1:
    
           # option 2
            if df['A'].iloc[index-1]==-1 or df['A'].iloc[index+1]==-1:
                index_vect.append(index)
    
            # option 3
            else:
                df['A'].iloc[index] = int((df['A'].iloc[index+1]+df['A'].iloc[index-1])/2)
    
        # option 1        
        elif index==N and df['A'].iloc[index]==-1:
            index_vect.append(index)
    
    # remove rows to be deleated
    df = df.drop(index_vect).reset_index(drop = True)
    

    正如您所看到的,代码很长,我想知道您是否可以建议更智能,更有效的方法来获得相同的结果。 此外,我注意到我的代码返回了行df['A'].iloc[index] = int((df['A'].iloc[index+1]+df['A'].iloc[index-1])/2)引起的警告消息 你知道我如何优化这样的代码行吗?

1 个答案:

答案 0 :(得分:3)

这是一个解决方案:

import numpy as np

# Let's replace -1 by Not a Number (NaN)
df.ix[df.A==-1,'A'] = np.nan

# If df.A is NaN and either the previous or next is also NaN, we don't select it
# This takes care of the condition on the first and last row too
df = df[~(df.A.isnull() & (df.A.shift(1).isnull() | df.A.shift(-1).isnull()))]

# Use interpolate to fill with the average of previous and next
df.A = df.A.interpolate(method='linear', limit=1)

以下是df

的结果
    A       B       C
1   20.0    www     CCC
2   30.0    eee     VVV
3   40.0    rrr     BBB
4   50.0    ttt     NNN
5   60.0    yyy     MMM
6   70.0    uuu     LLL

然后,您可以根据需要重置索引。