Python数据帧组由多个具有条件求和的列组成

时间:2017-02-20 17:03:11

标签: python pandas dataframe group-by

我有df看起来像这样:

col1    col2       now        previous      target
 A        1      1-1-2015     4-1-2014       0.2
 B        0      2-1-2015     2-5-2014       0.33
 A        0      3-1-2013     3-9-2011       0.1
 A        1      1-1-2014     4-9-2011       1.7
 A        1      31-12-2014   4-9-2014       1.9

我正在按照col1和col2对df进行分组,并且对于每个组的每个成员,我想要将target值仅与其他组成员相加,now日期值小于(之前)当前成员的previous日期值。

例如:

col1    col2       now        previous      target
 A        1      1-1-2015     4-1-2014       0.2

我想总结目标值:

col1    col2       now        previous      target
 A        0      3-1-2013     3-9-2011       0.1
 A        1      1-1-2014     4-9-2011       1.7

最终拥有:

col1    col2       now        previous      target    sum
 A        1      1-1-2015     4-1-2014       0.2      1.8

1 个答案:

答案 0 :(得分:0)

有趣的问题,我有一些我认为可行的方法。虽然req = new ReadInputDiscretesRequest(ref, count); Worst case: O(n**3)的时间复杂度很慢。

设置数据

Best case: O(n**2)

算法的伪代码

import pandas as pd
import numpy as np
import io

datastring = io.StringIO(
"""
col1    col2       now        previous      target
 A        1      1-1-2015     4-1-2014       0.2
 B        0      2-1-2015     2-5-2014       0.33
 A        0      3-1-2013     3-9-2011       0.1
 A        1      1-1-2014     4-9-2011       1.7
 A        1      31-12-2014   4-9-2014       1.9
 C        1      31-12-2014   4-9-2014       1.9
""")
# arguements for pandas.read_csv
kwargs = {
    "sep": "\s+", # specifices that it's a space separated file
    "parse_dates": [2,3], # parse "now" and "previous" as dates
    }
# read the csv into a pandas dataframe
df = pd.read_csv(datastring, **kwargs)

运行算法

首先设置一个函数For each row: For each *other* row: If "now" of *other* row comes before "previous" of row Then add *other* rows "target" to "sum" of row ,该函数将应用于f()计算的所有组。 df.groupby(["col1","col2"])所做的就是尝试实现上面的伪代码。

f()

现在只需对分组数据应用def f(df): _sum = np.zeros(len(df)) # represent the desired columns of the sub-dataframe as a numpy object data = df[["now","previous","target"]].values # loop through the rows in the sub-dataframe, df for i, outer_row in enumerate(data): # for each row, loop through all the rows again for j, inner_row in enumerate(data): # skip iteration if outer loop row is equal to the inner loop row if i==j: continue # get the dates from rows outer_prev = outer_row[1] inner_now = inner_row[0] # if the "previous" datetime of the outer loop is greater than # the "now" datetime of the inner loop, then add "target" to # to the cumulative sum if outer_prev > inner_now: _sum[i] += inner_row[2] # add a new column for this new "sum" that we calculated df["sum"] = _sum return df

f()

输出

done = df.groupby(["col1","col2"]).apply(f)