在熊猫中按时间间隔分层抽样

时间:2020-10-21 10:53:38

标签: python pandas dataframe pandas-groupby sample

我有一个数据帧,其中包含电影的用户评分。 .describe()方法提供以下信息:

         userId         ratings_per_user
count   137658.000000   137658.000000
mean    69247.463068    65.745514
std     39977.471244    67.071719
min     1.000000        1.000000
25%     34628.250000    22.000000
50%     69249.500000    41.000000
75%     103868.750000   84.000000
max     138493.000000   462.000000

现在我想从每个四分位数中抽取X个用户作为样本:

X users with number of votes between min and 25%
X users with number of votes between 25% and 50%
X users with number of votes between 50% and 75%
X users with number of votes between 75% and max

最后列出了大小为4X的userId个列表。

到目前为止,我所做的是一个笨拙的代码,该代码根据四分位数将数据帧分为4个不同的数据帧,并从每个样本中采样X个用户,然后合并所得的数据帧。但是我希望有一个更简单(更快)的解决方案。

编辑:

更好的解决方案:

#define total sample size desired
N = 100

ratings_user = ratings.groupby(['userId']).size().reset_index(name='ratings_per_user')

ratings_user['categ'] = np.where(ratings_user['ratings_per_user']>=84.0, 'A', 
                        np.where(ratings_user['ratings_per_user']>=41.0, 'B', 
                        np.where(ratings_user['ratings_per_user']>22.0, 'C', 'D'                              
                                )))
ratings = ratings_user.groupby('categ', group_keys=False).apply(lambda x: x.sample(int(np.rint(N*len(x)/len(ratings_user))))).sample(frac=1).reset_index(drop=True)

0 个答案:

没有答案