在Pandas DataFrame的每一行中获取趋势/条纹

时间:2016-02-10 01:47:16

标签: python numpy pandas

我有一个Pandas DataFrame:

df = pd.DataFrame([['A', 0.1, 2.0, 1.0, 0.5, 0.3],
                   ['B', -0.3, -0.4, 0.1, 0.2, -1.0],
                   ['C', 0.1, -1.0, 4.0, -3.3, 1.0],
                   ['D', -0.1, -1.0, -4.0, -3.3, -1.0],
                   ['E', np.nan, np.nan, np.nan, np.nan, np.nan],
                   ['F', 4.0, np.nan, np.nan, np.nan, np.nan]
                  ], columns=['Group', '1', '2', '3', '4', '5'])


  Group    1    2    3    4    5  
0     A  0.1  2.0  1.0  0.5  0.3  
1     B -0.3 -0.4  0.1  0.2 -1.0  
2     C  0.1 -1.0  4.0 -3.3  1.0  
3     D -0.1 -1.0 -4.0 -3.3 -1.0  
4     E  NaN  NaN  NaN  NaN  NaN  
5     F  4.0  NaN  NaN  NaN  NaN  

对于每一行,我想返回从左到右的连续正/负值的趋势/条纹。所以,最终的DataFrame应该是:

  Group    1    2    3    4    5  Streak  
0     A  0.1  2.0  1.0  0.5  0.3       5   
1     B -0.3 -0.4  0.1  0.2 -1.0      -2   
2     C  0.1 -1.0  4.0 -3.3  1.0       1   
3     D -0.1 -1.0 -4.0 -3.3 -1.0      -5   
4     E  NaN  NaN  NaN  NaN  NaN       0    
5     F  4.0  NaN  NaN  NaN  NaN       1 

第一行的条纹为+5,因为值从左到右都是正数。第二行的条纹为负-2,因为前两列具有负值,条纹在第3列中以正值结束。第三行的条纹为+1,因为第二列与第一列的符号相反柱。第四行全部是NaN,因此条纹为零。

3 个答案:

答案 0 :(得分:0)

这有点啰嗦,但它似乎做了你需要的一切:

def streak(row):

    cols = row.keys()    
    n_cols = len(cols)

    neg_streak = 0
    pos_streak = 0
    i_neg_streak = n_cols
    i_pos_streak = n_cols

    for icol_1 in range(n_cols - 1):
        for icol_2 in range(icol_1, n_cols):
            if (row.ix[icol_1: icol_2 + 1] < 0).all():
                streak = icol_1 - icol_2 - 1
                if streak < neg_streak:
                    neg_streak = streak
                    i_neg_streak = icol_1
            elif (row.ix[icol_1: icol_2 + 1] > 0).all():
                streak = 1 + icol_2 - icol_1
                if streak > pos_streak:
                    pos_streak = streak
                    i_pos_streak = icol_1

    if pos_streak == abs(neg_streak):
        if i_pos_streak < i_neg_streak:
            return pos_streak
        else:
            return neg_streak
    elif pos_streak > abs(neg_streak):
        return pos_streak
    else:
        return neg_streak

df = pd.DataFrame([['A', 0.1, 2.0, 1.0, 0.5, 0.3],
                   ['B', -0.3, -0.4, 0.1, 0.2, -1.0],
                   ['C', 0.1, -1.0, 4.0, -3.3, 1.0]
                   ], columns=['Group', '1', '2', '3', '4', '5'])

df = df.set_index('Group')
df['Streak'] = df.apply(lambda row: streak(row), axis = 1)
df = df.reset_index()

print df

答案 1 :(得分:0)

我假设你想要最长的连胜。 无法对关系作出任何承诺...... 此答案使用itertools.groupby。首先,在引擎盖下,你可以看到groupby在做什么:

In [4]: b = [-0.3, -0.4, 0.1, 0.2, -1.0]
        for k,g in groupby(b, key=lambda x: x > 0.0):
           print k,list(g)

False [-0.3, -0.4]
True [0.1, 0.2]
False [-1.0]

现在将其包含在函数中,利用分组:

def streak(dfrow):
    longest= 0
    for k,g in groupby(dfrow, key=lambda x: False if x<0 else True if x>0 else np.nan):
        cur_streak = len(list(g))
        if np.isnan(k):
            continue
        if k: #group is positive
            if abs(longest) < cur_streak:
                longest= cur_streak
        else: #group is negative
            if abs(longest) < cur_streak:
                longest= -1*cur_streak #multiply by -1
    return longest

使用df.apply将函数应用于每一行:

In [6]: df.set_index('Group',inplace=True)
        df['LongestStreak'] = df.apply(streak, axis=1)

结果:

In [281]: df
Out[281]:       1   2   3   4   5   LongestStreak
        Group                       
          A     0.1     2.0     1.0     0.5     0.3     5
          B     -0.3    -0.4    0.1     0.2     -1.0    -2
          C     0.1     -1.0    4.0     -3.3    1.0     1

<强> 修改

已更新以解决您的新DataFrame并添加了基准,您可能会更好地扩展,但我不知道如何修改代码以生成结果。

结果:

%%timeit
df['LongestStreak'] = df.apply(streak, axis=1)

1000 loops, best of 3: 473 µs per loop


%%timeit
a = (df[['1', '2', '3', '4', '5']] >= 0).values # Get True/False values
diff = a[:, :-1] == a[:, 1:]
false_col = np.zeros((a.shape[0], 1), dtype=bool)  # Create a column of False
diff = np.concatenate((diff, false_col), axis=1)
df['Streak'] = np.argmin(diff, axis=1) + 1
df['Sign'] = df['1']
df['Sign'] = np.where(df['Sign'] > 0, 1, df['Sign'])
df['Sign'] = np.where(df['Sign'] < 0, -1, df['Sign'])
df['Sign'] = np.where(df['Sign'].isnull(), 0, df['Sign'])
df['Streak'] = df['Streak'] * df['Sign']
df['Streak'] = df['Streak'].astype(int)
df.drop('Sign', axis=1, inplace=True)

100 loops, best of 3: 2.94 ms per loop

答案 2 :(得分:0)

这样做了,更直观/矢量化

a = (df[['1', '2', '3', '4', '5']] >= 0).values  # Get True/False values
diff = a[:, :-1] == a[:, 1:]  # Compare values from neighboring columns

所以diff看起来像这样:

[[ True  True  True  True]
 [ True False  True False]
 [False False False False]
 [ True  True  True  True]]

然后,

false_col = np.zeros((a.shape[0], 1), dtype=bool)  # Create a column of False
diff = np.concatenate((diff, false_col), axis=1)  # Add False column to end of diff

[[ True  True  True  True False]
 [ True False  True False False]
 [False False False False False]
 [ True  True  True  True False]]

接下来,我们通过查找True的第一次出现来查找False的条纹:

df['Streak'] = np.argmin(diff, axis=1) + 1  # Add 1 to the index get the streak

最后,我们根据第一列的符号调整条纹值的符号:

df['Sign'] = df['1']
df['Sign'] = np.where(df['Sign'] > 0, 1, df['Sign'])
df['Sign'] = np.where(df['Sign'] < 0, -1, df['Sign'])
df['Sign'] = np.where(df['Sign'].isnull(), 0, df['Sign'])
df['Streak'] = df['Streak'] * df['Sign']
df['Streak'] = df['Streak'].astype(int)
df.drop('Sign', axis=1, inplace=True)

最终的DataFrame看起来像这样:

  Group    1    2    3    4    5  Streak  
0     A  0.1  2.0  1.0  0.5  0.3       5  
1     B -0.3 -0.4  0.1  0.2 -1.0      -2  
2     C  0.1 -1.0  4.0 -3.3  1.0       1  
3     D -0.1 -1.0 -4.0 -3.3 -1.0      -5  
4     E  NaN  NaN  NaN  NaN  NaN       0  
5     F  4.0  NaN  NaN  NaN  NaN       1