如何计算每日用户差异并重塑熊猫数据框?

时间:2020-09-19 23:17:03

标签: python pandas date dataset transpose

我正在使用Python处理Pandas DataFrame,目前我具有以下架构:

>>> import pandas as pd
>>> d = {'date': ['15-Sep','16-Sep','17-Sep','18-Sep','15-Sep','16-Sep','17-Sep','18-Sep','15-Sep','16-Sep','17-Sep','18-Sep'],
...      'user': ['A','A','A','A','B','B','B','B','C','C','C','C'],
...      'sales': [5,8,6,7,9,12,11,11,11,15,8,6]}
>>> df = pd.DataFrame(data=d)
>>> df
      date user  sales
0   15-Sep    A      5
1   16-Sep    A      8
2   17-Sep    A      6
3   18-Sep    A      7
4   15-Sep    B      9
5   16-Sep    B     12
6   17-Sep    B     11
7   18-Sep    B     11
8   15-Sep    C     11
9   16-Sep    C     15
10  17-Sep    C      8
11  18-Sep    C      6

current dataset

并希望通过转换(转置?)来获得每位用户的每日差额,以上述示例为参考,参考上一天,

>>> d = {'user': ['A','B','C'],
...      '16-Sep': [3,3,4],
...      '17-Sep': [-2,-1,-7],
...      '18-Sep': [1,0,-2]}
>>> df = pd.DataFrame(data=d)
>>> df
  user  16-Sep  17-Sep  18-Sep
0    A       3      -2       1
1    B       3      -1       0
2    C       4      -7      -2

goal_table

此目标表意味着,与9月16日相比,用户A在9月17日售出了-2件商品。

什么是最好的方法? 关于如何执行此操作的任何示例?我找不到类似的问题。

2 个答案:

答案 0 :(得分:2)

  • .sort_values在数据帧上由userdate
  • .groupby('user', as_index=False)和汇总差异。
    • 如果不包含TypeError: rename() got an unexpected keyword argument 'columns',则会发生as_index=False
  • df.join.groupby结果。
  • .pivot数据框和.dropna
import pandas as pd

# setup test dataframe
d = {'date': ['15-Sep','16-Sep','17-Sep','18-Sep','15-Sep','16-Sep','17-Sep','18-Sep','15-Sep','16-Sep','17-Sep','18-Sep'], 'user': ['A','A','A','A','B','B','B','B','C','C','C','C'], 'sales': [5,8,6,7,9,12,11,11,11,15,8,6]}
df = pd.DataFrame(data=d)

# groupby and join to df
dfg = df.sort_values(['user', 'date']).join(df.groupby('user', as_index=False)['sales'].diff().rename(columns={'sales': 'sales_diff'}))

# pivot the dataframe into the correct shape
dfp = dfg.pivot(columns='date', index='user', values='sales_diff').reset_index().dropna(axis=1)

# remove the name of the columns (e.g. date)
dfp.columns.name = None

# display(dfp)
  user  16-Sep  17-Sep  18-Sep
0    A     3.0    -2.0     1.0
1    B     3.0    -1.0     0.0
2    C     4.0    -7.0    -2.0

答案 1 :(得分:1)

在Trenton有效答案之后,.rename()函数出现错误,因此我添加了一个额外的步骤来克服此问题。

以下代码对我有用:

import pandas as pd

d = {'date': ['15-Sep','16-Sep','17-Sep','18-Sep','15-Sep','16-Sep','17-Sep','18-Sep','15-Sep','16-Sep','17-Sep','18-Sep'],
     'user': ['A','A','A','A','B','B','B','B','C','C','C','C'],
     'sales': [5,8,6,7,9,12,11,11,11,15,8,6]}
df = pd.DataFrame(data=d)
#print("Original dataset\n",df,"\n")

# Sort values on user and date (to obtain proper differences)
df = df.sort_values(['user', 'date'])

# Add column sales_diff that groups by user, ad .diff() in sales 
df['sales_diff'] = df.groupby(['user'])['sales'].diff()
#print("Added difference sales column\n",df,"\n")

#Pivot table on user and sales_diff
dfp = df.pivot(columns='date', index='user', values='sales_diff').reset_index().dropna(axis=1)
#print("Pivot dataset on user\n",dfp,"\n")

# Remove the name of the columns (e.g. date)
dfp.columns.name = None

#print("Goal dataset obtained\n")
print(dfp)
  user  16-Sep  17-Sep  18-Sep
0    A     3.0    -2.0     1.0
1    B     3.0    -1.0     0.0
2    C     4.0    -7.0    -2.0