我的数据如下:
df =
contact_id date answer
1 1-May 9.00-10.00
1 2-May 9.00-10.00
1 3-May 11.00-12.00
1 4-May 14.00-15.00
1 5-May 10.00-11.00
2 2-May 9.00-10.00
2 3-May 9.00-10.00
3 1-May 14.00-15.00
3 2-May 10.00-11.00
我希望数据看起来像:
df =
contact_id date 9.00-10.00 10.00-11.00 11.00-12.00 14.00-15.00 answer
1 1-May 0 0 0 0 9.00-10.00
1 2-May 1 0 0 0 9.00-10.00
1 3-May 2 0 0 0 11.00-12.00
1 4-May 2 0 1 0 14.00-15.00
1 5-May 2 0 1 1 10.00-11.00
2 2-May 0 0 0 0 9.00-10.00
2 3-May 1 0 0 0 9.00-10.00
3 1-May 0 0 0 0 14.00-15.00
3 2-May 0 0 0 1 10.00-11.00
[UPDATE] 所以我得到了一些似乎有效但不完全正确的逻辑。如果你能在这里建议我的话会很棒。以下是我的代码:
df1 = pd.concat([df,pd.get_dummies(df["answer"])],1)
df1["target"] = df1["answer"]
df1.drop("answer", inplace=True, axis=1)
tar_list = df1["target"].unique().tolist()
def arith(x):
NinetoTenCounter = 0
EleventoTwelveCounter = 0
TwotoThreeCounter = 0
TentoElevenCounter = 0
for i in range(len(x["target"])):
print(i)
if x.iloc[i, -1] == "9.00-10.00":
NinetoTenCounter += 1
x.loc[i, '9.00-10.00'] = NinetoTenCounter - 1
x.loc[i, '10.00-11.00'] = TentoElevenCounter
x.loc[i, '11.00-12.00'] = EleventoTwelveCounter
x.loc[i, '14.00-15.00'] = TwotoThreeCounter
elif x.iloc[i,-1] == "11.00-12.00":
EleventoTwelveCounter += 1
x.loc[i, '9.00-10.00'] = NinetoTenCounter
x.loc[i, '10.00-11.00'] = TentoElevenCounter
x.loc[i, '11.00-12.00'] = EleventoTwelveCounter - 1
x.loc[i, '14.00-15.00'] = TwotoThreeCounter
elif x.iloc[i,-1] == "14.00-15.00":
TwotoThreeCounter += 1
x.loc[i, '9.00-10.00'] = NinetoTenCounter
x.loc[i, '10.00-11.00'] = TentoElevenCounter
x.loc[i, '11.00-12.00'] = EleventoTwelveCounter
x.loc[i, '14.00-15.00'] = TwotoThreeCounter - 1
else:
TentoElevenCounter += 1
x.loc[i, '9.00-10.00'] = NinetoTenCounter
x.loc[i, '10.00-11.00'] = TentoElevenCounter - 1
x.loc[i, '11.00-12.00'] = EleventoTwelveCounter
x.loc[i, '14.00-15.00'] = TwotoThreeCounter
return x
df1_arith = df1.groupby("contact_id").apply(arith)
我得到的输出如下:
contact_id date 10.00-11.00 11.00-12.00 14.00-15.00 \
contact_id
1 0 1.0 1-may 0.0 0.0 0.0
1 1.0 2-may 0.0 0.0 0.0
2 1.0 3-may 0.0 0.0 0.0
3 1.0 4-may 0.0 1.0 0.0
4 1.0 5-may 0.0 1.0 1.0
2 5 2.0 2-may 0.0 0.0 0.0
6 2.0 3-may 0.0 0.0 0.0
0 NaN NaN 0.0 0.0 0.0
1 NaN NaN 0.0 0.0 0.0
3 7 3.0 1-may 0.0 0.0 1.0
8 3.0 2-may 1.0 0.0 0.0
0 NaN NaN 0.0 0.0 0.0
1 NaN NaN 0.0 0.0 1.0
9.00-10.00 target
contact_id
1 0 0.0 9.00-10.00
1 1.0 9.00-10.00
2 2.0 11.00-12.00
3 2.0 14.00-15.00
4 2.0 10.00-11.00
2 5 1.0 9.00-10.00
6 1.0 9.00-10.00
0 0.0 NaN
1 1.0 NaN
3 7 0.0 14.00-15.00
8 0.0 10.00-11.00
0 0.0 NaN
1 0.0 NaN
该函数似乎适用于第一组,即contact_id
1,而不完全适用于其他contact_id
。所以现在的问题是:
1)为什么要插入NaN行?如何让它们不插入? 2)这是将功能应用于每个组的正确方法吗? 3)将功能应用于每个组的正确方法是什么?
如果我遗失了什么,请建议我。非常感谢,一如既往。
答案 0 :(得分:2)
IIUC:
In [25]: (df.assign(col=df['answer'])
.pivot_table(index=['contact_id','date','answer'],
columns='col', aggfunc='size')
.fillna(0)
.reset_index())
Out[25]:
col contact_id date answer 10.00-11.00 11.00-12.00 14.00-15.00 9.00-10.00
0 1 1-May 9.00-10.00 0.0 0.0 0.0 1.0
1 1 2-May 9.00-10.00 0.0 0.0 0.0 1.0
2 1 3-May 11.00-12.00 0.0 1.0 0.0 0.0
3 1 4-May 14.00-15.00 0.0 0.0 1.0 0.0
4 1 5-May 10.00-11.00 1.0 0.0 0.0 0.0
5 2 2-May 9.00-10.00 0.0 0.0 0.0 1.0
6 2 3-May 9.00-10.00 0.0 0.0 0.0 1.0
7 3 1-May 14.00-15.00 0.0 0.0 1.0 0.0
8 3 2-May 10.00-11.00 1.0 0.0 0.0 0.0
答案 1 :(得分:2)
pd.concat([df,pd.get_dummies(df.answer)],1)
Out[1272]:
contact_id date answer 10.00-11.00 11.00-12.00 14.00-15.00 9.00-10.00
0 1 1-May 9.00-10.00 0 0 0 1
1 1 2-May 9.00-10.00 0 0 0 1
2 1 3-May 11.00-12.00 0 1 0 0
3 1 4-May 14.00-15.00 0 0 1 0
4 1 5-May 10.00-11.00 1 0 0 0
5 2 2-May 9.00-10.00 0 0 0 1
6 2 3-May 9.00-10.00 0 0 0 1
7 3 1-May 14.00-15.00 0 0 1 0
8 3 2-May 10.00-11.00 1 0 0 0
答案 2 :(得分:0)
我设法以非常优雅的方式解决了我的问题。如果有人需要参考,请在此处评论,我将分享我的解决方案。谢谢。