我有一个包含三列(CUST_ID,TOPIC,VALUE)的数据模型
data = pd.DataFrame({"CUST_ID":["C1", "C1", "C2", "C3", "C3"],
"TOPIC":["TOPIC1", "TOPIC2", "TOPIC2", "TOPIC1", "TOPIC2"],
"VALUE":[10, 15, 8, 5, 20]})
我想按CUST_ID分组,将“ TOPIC”列转换为“ TOPIC_a_VALUE”和“ TOPIC_b_VALUE”两列
我知道如何通过SQL来实现,但是如何通过熊猫来实现?
SELECT CUST_ID,
MAX(CASE WHEN TOPIC = "TOPIC1" THEN VALUE ELSE 0 END) AS TOPIC_a_VALUE
MAX(CASE WHEN TOPIC = "TOPIC2" THEN VALUE ELSE 0 END) AS TOPIC_b_VALUE
FROM TABLE
GROUP BY CUST_ID
我想要的结果如下,
result = pd.DataFrame({"CUST_ID":["C1", "C2", "C3"],
"TOPIC_a_VALUE":[10, np.nan, 5],
"TOPIC_b_VALUE":[15, 8, 20]})
答案 0 :(得分:1)
IIUC,您需要类似的东西:
df=data.pivot_table(index=['CUST_ID','TOPIC'],columns=['TOPIC']).reset_index()
df.columns=[''.join(col) for col in df.columns.values]
df.loc[df.CUST_ID.duplicated(keep=False)]=df.loc[df.CUST_ID.duplicated(keep=False)].bfill()
df=df.drop_duplicates('CUST_ID')
df=df.drop([col for col in df.columns if 'Key' in col],axis=1).reset_index(drop=True)
print(df)
CUST_ID TOPIC VALUETOPIC1 VALUETOPIC2
0 C1 TOPIC1 10.0 15.0
1 C2 TOPIC2 NaN 8.0
2 C3 TOPIC1 5.0 20.0
答案 1 :(得分:1)
也许比其他建议的答案更具可读性,我会同意:
data.groupby(['CUST_ID', 'TOPIC'])['VALUE'].max().unstack()
# Output
#TOPIC TOPIC1 TOPIC2
#CUST_ID
#C1 10.0 15.0
#C2 NaN 8.0
#C3 5.0 20.0
如果愿意,您当然可以重命名列:
.rename(columns={'TOPIC1': 'TOPIC_a_VALUE', 'TOPIC2': 'TOPIC_b_VALUE'})
答案 2 :(得分:0)
您的查询在SQL中没有意义。我认为您打算这样做:
SELECT CUST_ID,
MAX(CASE WHEN TOPIC = 'a' THEN VALUE ELSE 0 END) AS TOPIC_a_VALUE
MAX(CASE WHEN TOPIC = 'b' THEN VALUE ELSE 0 END) AS TOPIC_b_VALUE
FROM TABLE
GROUP BY CUST_ID;
这对Pandas解决方案没有直接帮助,但至少查询是有意义的。
答案 3 :(得分:0)
您可以通过以下方式使用groupby
:
df=data.pivot_table(index=['CUST_ID','TOPIC'],columns=['TOPIC']).reset_index()
df.columns=[''.join(col) for col in df.columns.values]
df1 = df.groupby('CUST_ID').ffill()\
.groupby('CUST_ID').last()\
.reset_index()
清理数据框
df1 = df1.drop(columns = ['TOPIC']).
rename(columns{'VALUETOPIC1':'TOPIC_a_VALUE','VALUETOPIC2':'TOPIC_b_VALUE'})