如何使用pandas创建带有条件累积和的新列?

时间:2017-03-15 09:25:03

标签: python pandas numpy

以下代码创建一个值为-1,0或1的随机数据框:

df = pd.DataFrame(np.random.randint(-1,2,size=(100, 1)), columns=['val'])

print(df['val'].value_counts())

让我们看看它包含什么:

-1    36
 0    35
 1    29
Name: val, dtype: int64

然后,我正在尝试创建一个名为mysum的新列,其累积条件总和遵循下一个规则:

  • 如果val = 1且mysum> = 0,那么mysum = mysum + 1。
  • 如果val = 1且mysum< 0,然后是mysum = mysum + 2.

  • 如果val = -1且mysum< = 0,那么mysum = mysum - 1.

  • 如果val = -1且mysum> 0,然后是mysum = mysum - 2

  • 如果val = 0且mysum< 0,然后是mysum = mysum + 1.

  • 如果val = 0且mysum> 0,然后是mysum = mysum - 1.

  • 如果val = 0且mysum = 0,那么mysum = mysum。

所以我担心它不会那么简单:

df['mysum'] = df['val'].cumsum()

所以我尝试了以下内容:

df['mysum'] = 0

df['mysum'] = np.where((df['val'] == 1) & (df['mysum'].cumsum() >= 0), (df['mysum'].cumsum() + 1), df['mysum'].cumsum())
df['mysum'] = np.where((df['val'] == 1) & (df['mysum'].cumsum() < 0), (df['mysum'].cumsum() + 2), df['mysum'].cumsum())

df['mysum'] = np.where((df['val'] == -1) & (df['mysum'].cumsum() <= 0), (df['mysum'].cumsum() - 1), df['mysum'].cumsum())
df['mysum'] = np.where((df['val'] == -1) & (df['mysum'].cumsum() > 0), (df['mysum'].cumsum() - 2), df['mysum'].cumsum())

df['mysum'] = np.where((df['val'] == 0) & (df['mysum'].cumsum() > 0), (df['mysum'].cumsum() - 1), df['mysum'].cumsum())
df['mysum'] = np.where((df['val'] == 0) & (df['mysum'].cumsum() < 0), (df['mysum'].cumsum() + 1), df['mysum'].cumsum())


print(df['mysum'].value_counts())
print(df)

但是mysum列没有累积!

这是一个小提琴,您可以尝试:https://repl.it/FaXZ/8

2 个答案:

答案 0 :(得分:1)

也许存在更精简的解决方案,但您可以遍历数据框并根据您的条件设置值。

import numpy as np
import pandas as pd

df = pd.DataFrame(np.random.randint(-1, 2, size=(100, 1)), columns=['val'])

df['mysum'] = 0

for index, row in df.iterrows():

    # get the current value of mysum = mysum one row above current index
    mysum = df.get_value(index - 1, 1, takeable=True)

    # mysum at beginning is 0
    if index == 0:
        mysum = 0

    # set values at current index according to conditions
    if row[0] == 0 and mysum < 0:
        df.set_value(index, 1, mysum + 1, takeable=True)
    if row[0] == 1 and mysum < 0:
        df.set_value(index, 1, mysum + 2, takeable=True)
    if row[0] == -1 and mysum <= 0:
        df.set_value(index, 1, mysum - 1, takeable=True)
    if row[0] == 0 and mysum > 0:
        df.set_value(index, 1, mysum - 1, takeable=True)
    if row[0] == -1 and mysum > 0:
        df.set_value(index, 1, mysum - 2, takeable=True)
    if row[0] == 1 and mysum >= 0:
        df.set_value(index, 1, mysum + 1, takeable=True)
    if row[0] == 0 and mysum == 0:
        df.set_value(index, 1, mysum, takeable=True)

print df

答案 1 :(得分:0)

更有效的解决方案,另请参见generalized cumulative functions in NumPy/SciPy?

import numpy as np
import pandas as pd

df = pd.DataFrame(np.random.randint(-1, 2, size=(100, 1)), columns=['val'])
def my_sum(acc,x):
    if x == 0 and acc < 0:
        return acc + 1
    if x == 1 and acc < 0:
        return acc + 2
    if x == -1 and acc <= 0:
        return acc - 1
    if x == 0 and acc > 0:
        return acc - 1
    if x == -1 and acc > 0:
        return acc - 2
    if x == 1 and acc >= 0:
        return acc + 1
    if x == 0 and acc == 0:
        return acc
u_my_sum = np.frompyfunc(my_sum, 2, 1)
df['mysum'] = u_my_sum.accumulate(df.val, dtype=np.object).astype(np.int64)
print(df)