Question

我有一个数据帧，其中包含电影的用户评分。 .describe()方法提供以下信息：

         userId         ratings_per_user
count   137658.000000   137658.000000
mean    69247.463068    65.745514
std     39977.471244    67.071719
min     1.000000        1.000000
25%     34628.250000    22.000000
50%     69249.500000    41.000000
75%     103868.750000   84.000000
max     138493.000000   462.000000

现在我想从每个四分位数中抽取X个用户作为样本：

X users with number of votes between min and 25%
X users with number of votes between 25% and 50%
X users with number of votes between 50% and 75%
X users with number of votes between 75% and max

最后列出了大小为4X的userId个列表。

到目前为止，我所做的是一个笨拙的代码，该代码根据四分位数将数据帧分为4个不同的数据帧，并从每个样本中采样X个用户，然后合并所得的数据帧。但是我希望有一个更简单（更快）的解决方案。

编辑：

更好的解决方案：

#define total sample size desired
N = 100

ratings_user = ratings.groupby(['userId']).size().reset_index(name='ratings_per_user')

ratings_user['categ'] = np.where(ratings_user['ratings_per_user']>=84.0, 'A', 
                        np.where(ratings_user['ratings_per_user']>=41.0, 'B', 
                        np.where(ratings_user['ratings_per_user']>22.0, 'C', 'D'                              
                                )))
ratings = ratings_user.groupby('categ', group_keys=False).apply(lambda x: x.sample(int(np.rint(N*len(x)/len(ratings_user))))).sample(frac=1).reset_index(drop=True)

在熊猫中按时间间隔分层抽样

0 个答案: