我有df
看起来像这样:
col1 col2 now previous target
A 1 1-1-2015 4-1-2014 0.2
B 0 2-1-2015 2-5-2014 0.33
A 0 3-1-2013 3-9-2011 0.1
A 1 1-1-2014 4-9-2011 1.7
A 1 31-12-2014 4-9-2014 1.9
我正在按照col1和col2对df
进行分组,并且对于每个组的每个成员,我想要将target
值仅与其他组成员相加,now
日期值小于(之前)当前成员的previous
日期值。
例如:
col1 col2 now previous target
A 1 1-1-2015 4-1-2014 0.2
我想总结目标值:
col1 col2 now previous target
A 0 3-1-2013 3-9-2011 0.1
A 1 1-1-2014 4-9-2011 1.7
最终拥有:
col1 col2 now previous target sum
A 1 1-1-2015 4-1-2014 0.2 1.8
答案 0 :(得分:0)
有趣的问题,我有一些我认为可行的方法。虽然req = new ReadInputDiscretesRequest(ref, count);
和Worst case: O(n**3)
的时间复杂度很慢。
Best case: O(n**2)
import pandas as pd
import numpy as np
import io
datastring = io.StringIO(
"""
col1 col2 now previous target
A 1 1-1-2015 4-1-2014 0.2
B 0 2-1-2015 2-5-2014 0.33
A 0 3-1-2013 3-9-2011 0.1
A 1 1-1-2014 4-9-2011 1.7
A 1 31-12-2014 4-9-2014 1.9
C 1 31-12-2014 4-9-2014 1.9
""")
# arguements for pandas.read_csv
kwargs = {
"sep": "\s+", # specifices that it's a space separated file
"parse_dates": [2,3], # parse "now" and "previous" as dates
}
# read the csv into a pandas dataframe
df = pd.read_csv(datastring, **kwargs)
首先设置一个函数For each row:
For each *other* row:
If "now" of *other* row comes before "previous" of row
Then add *other* rows "target" to "sum" of row
,该函数将应用于f()
计算的所有组。 df.groupby(["col1","col2"])
所做的就是尝试实现上面的伪代码。
f()
现在只需对分组数据应用def f(df):
_sum = np.zeros(len(df))
# represent the desired columns of the sub-dataframe as a numpy object
data = df[["now","previous","target"]].values
# loop through the rows in the sub-dataframe, df
for i, outer_row in enumerate(data):
# for each row, loop through all the rows again
for j, inner_row in enumerate(data):
# skip iteration if outer loop row is equal to the inner loop row
if i==j: continue
# get the dates from rows
outer_prev = outer_row[1]
inner_now = inner_row[0]
# if the "previous" datetime of the outer loop is greater than
# the "now" datetime of the inner loop, then add "target" to
# to the cumulative sum
if outer_prev > inner_now:
_sum[i] += inner_row[2]
# add a new column for this new "sum" that we calculated
df["sum"] = _sum
return df
。
f()
done = df.groupby(["col1","col2"]).apply(f)