在Python中对数据帧子集的子集执行计算

时间:2015-01-19 04:05:20

标签: python pandas dataframe

user_id   char_id   rating
100          33          3
100          44          2
100          33          1
100          44          4
111          55          5
111          44          4
111          55          5

我的数据框格式与此格式相似,并且在按user_idchar_id分组后,我尝试对评分进行计算。 它不起作用,但我需要执行data.groupby('user_id', 'char_id')之类的操作,然后为每个char_id计算每个user_id的移动平均值。有帮助吗?我有几千user_id因此我无法通过并一次选择一个进行计算。

我需要以某种方式迭代user_id列并将所有相同的user_id组合在一起,并保存该格式以便user_id是独立的。然后我需要做同样的事情,为每个char_id子集迭代user_id并保存该格式,以便我最终可以对评级子集的子集进行计算。到目前为止,我的所有尝试都没有成功。我最接近的是:

def divide_by_user(data):
    for user in data['user_id']:
        user_data = data.where(data['user_id'] == user)
        return user_data

2 个答案:

答案 0 :(得分:2)

无需手动执行此操作,创建和汇总此类子集正是DataFrame.groupby()的用途。创建你的groupby:

grouped = df.groupby(['user_id', 'char_id'])

然后,您可以将函数应用于每个子集。听起来你想要rolling_meanexpanding_mean,这两者都已在pandas中提供:

df['cum_average'] = grouped['rating'].apply(pd.expanding_mean)
# New column now contains the average rating for each subset,
#   including all values that have been seen so far.
df
Out[43]: 
   user_id  char_id  rating  cum_average
0      100       33       3            3
1      100       44       2            2
2      100       33       1            2
3      100       44       4            3
4      111       55       5            5
5      111       44       4            4
6      111       55       5            5

使用较大的随机生成的数据集来演示rolling_window()

df = pd.DataFrame({
    'user_id': [random.choice([100, 111, 112]) for n in range(n_rows)],
    'char_id': [random.choice([33, 44, 55]) for n in range(n_rows)],
    'rating': [random.choice([1, 2, 3, 4, 5]) for n in range(n_rows)]
})
grouped = df.groupby(['user_id', 'char_id'])
df['cum_average'] = grouped['rating'].apply(pd.rolling_mean, window=7)
# Output. The rolling average will be NaN until enough values have been
#   observed for that subset, you can change this using the
#   min_periods argument to rolling_window
df.sort(columns=['user_id', 'char_id'])
     char_id  rating  user_id  cum_average
3         33       1      100          NaN
19        33       2      100          NaN
22        33       5      100          NaN
34        33       1      100          NaN
47        33       1      100          NaN
48        33       1      100          NaN
49        33       1      100     1.714286
51        33       4      100     2.142857
55        33       2      100     2.142857
60        33       2      100     1.714286
66        33       2      100     1.857143
...
etc.

答案 1 :(得分:0)

试试这个: " DF"是dataFrame

mean = pd.rolling_mean(df.rating,7)