我有一些类似的数据:
#Simulate some data
d = {
"id": [1,1,1,1,1,2,2,2,2],
"action_order": [1,2,3,4,5,1,2,3,4],
"n_actions": [5,5,5,5,5,4,4,4,4],
"seed": ['1','2','3','4','5','10','11','12','13'],
"time_spent": [0.3,0.4,0.5,0.6,0.7,10.1,11.1,12.1,13.1]
}
data = pd.DataFrame(d)
我需要一个函数,该函数将为每一行返回该行中两列(seed和time_spent)的值以及该组中的所有先前行作为字典。我尝试按以下方式使用apply函数,但结果并不完全符合我的需要。
data \
.groupby(["profile_id"])[["artist_seed", "tlh"]] \
.apply(lambda x: dict(zip(x["artist_seed"], x["tlh"]))) \
.tolist()
data \
.groupby("profile_id")[["artist_seed", "tlh", "action_order"]] \
.apply(lambda x: dict(zip(list(x["artist_seed"]), list(x["tlh"]))))
新的DataFrame应该如下所示:
id new_col
0 1 {u'1': 0.3}
1 1 {u'1': 0.3, u'2': 0.4}
2 1 {u'1': 0.3, u'3': 0.5, u'2': 0.4}
...
答案 0 :(得分:0)
您可以保持dict
的运行状态,并且只需按每个组apply
进行迭代即可返回最新版本的副本:
def wrapper(g):
cumdict = {}
return g.apply(update_cumdict, args=(cumdict,), axis=1)
def update_cumdict(row, cd):
cd[row.seed] = row.time_spent
return cd.copy()
data["new_col"] = data.groupby("id").apply(wrapper).reset_index()[0]
data.new_col
0 {'1': 0.3}
1 {'1': 0.3, '2': 0.4}
2 {'1': 0.3, '2': 0.4, '3': 0.5}
3 {'1': 0.3, '2': 0.4, '3': 0.5, '4': 0.6}
4 {'1': 0.3, '2': 0.4, '3': 0.5, '4': 0.6, '5': ...
5 {'10': 10.1}
6 {'10': 10.1, '11': 11.1}
7 {'10': 10.1, '11': 11.1, '12': 12.1}
8 {'10': 10.1, '11': 11.1, '12': 12.1, '13': 13.1}
Name: new_col, dtype: object
答案 1 :(得分:0)
如何?
In [15]: data.groupby(['id']).apply(lambda d: pd.Series(np.arange(len(d))).apply(lambda x: d[['seed', 'time_spent']].iloc[:x+1].to_dict()))
Out[15]:
id
1 0 {'seed': {0: '1'}, 'time_spent': {0: 0.3}}
1 {'seed': {0: '1', 1: '2'}, 'time_spent': {0: 0...
2 {'seed': {0: '1', 1: '2', 2: '3'}, 'time_spent...
3 {'seed': {0: '1', 1: '2', 2: '3', 3: '4'}, 'ti...
4 {'seed': {0: '1', 1: '2', 2: '3', 3: '4', 4: '...
2 0 {'seed': {5: '10'}, 'time_spent': {5: 10.1}}
1 {'seed': {5: '10', 6: '11'}, 'time_spent': {5:...
2 {'seed': {5: '10', 6: '11', 7: '12'}, 'time_sp...
3 {'seed': {5: '10', 6: '11', 7: '12', 8: '13'},...
dtype: object
此外,您可以修改.to_dict()方法的参数以更改输出dict样式,请参阅:https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_dict.html
或者也许这就是您想要的:
In [18]: data.groupby(['id']).apply(lambda d: pd.Series(np.arange(len(d))).apply(lambda x: dict(zip(d['seed'].iloc[:x+1], d['time_spent'].iloc[:x+1]))))
Out[18]:
id
1 0 {'1': 0.3}
1 {'1': 0.3, '2': 0.4}
2 {'1': 0.3, '2': 0.4, '3': 0.5}
3 {'1': 0.3, '2': 0.4, '3': 0.5, '4': 0.6}
4 {'1': 0.3, '2': 0.4, '3': 0.5, '4': 0.6, '5': ...
2 0 {'10': 10.1}
1 {'10': 10.1, '11': 11.1}
2 {'10': 10.1, '11': 11.1, '12': 12.1}
3 {'10': 10.1, '11': 11.1, '12': 12.1, '13': 13.1}
dtype: object