快速而肮脏的答案

Question

我有一个熊猫DataFrame，它有大约2亿行，看起来像这样：

UserID  MovieID  Rating
1       455      5
2       411      4
1       288      2
2       300      3
2       137      5
1       300      3

...

我想为每个用户获取排名靠前的N部电影，按收视率降序排列，因此对于N = 2，输出应如下所示：

UserID  MovieID  Rating
1       455      5
1       300      3
2       137      5
2       411      4

当我尝试这样做时，我收到由“ groupby”引起的“内存错误”（我的计算机上有8gb的RAM）

df.sort_values(by=['rating']).groupby('userID').head(2)

有什么建议吗？

Answer 1

快速而肮脏的答案

鉴于排序有效，您可以通过以下方式来吱吱作响，该方式使用了基于Numpy的高效内存替代熊猫groupby：

import pandas as pd

d = '''UserID  MovieID  Rating
1       455      5
2       411      4
3       207      5
1       288      2
3        69      2
2       300      3
3       410      4
3       108      3
2       137      5
3       308      3
1       300      3'''
df = pd.read_csv(pd.compat.StringIO(d), sep='\s+', index_col='UserID')

df = df.sort_values(['UserID', 'Rating'])

# carefully handle the construction of ix to ensure no copies are made
ix = np.zeros(df.shape[0], np.int8)
np.subtract(df.index.values[1:], df.index.values[:-1], out=ix[:-1])

# the above assumes that UserID is the index of df. If it's just a column, use this instead
#np.subtract(df['UserID'].values[1:], df['UserID'].values[:-1], out=ix[:-1])

ix[:-1] += ix[1:]
ix[-2:] = 1
ix = ix.view(np.bool)
print(df.iloc[ix])

输出：

        MovieID  Rating
UserID                 
1           300       3
1           455       5
2           411       4
2           137       5
3           410       4
3           207       5

更多有效记忆的答案

对于这么大的东西，您应该只使用Numpy数组（Pandas用于在后台存储数据）来代替Pandas数据框。如果您使用适当的structured array，则应该能够将所有数据放入一个大小大约为一个的数组中：

2 * 10**8 * (4 + 2 + 1)
1,400,000,000 bytes
or ~1.304 GB

这意味着它（以及几个临时的计算对象）应该可以轻松装入8 GB系统内存中。

以下是一些详细信息：

最棘手的部分是初始化结构化数组。您也许可以手动初始化数组，然后通过以下方式复制数据：

dfdtype = np.dtype([('UserID', np.uint32), ('MovieID', np.uint16), ('Rating', np.uint8)])
arr = np.empty(df.shape[0], dtype=dfdtype)
arr['UserID'] = df.index.values
for n in dfdtype.names[1:]:
    arr[n] = df[n].values

如果以上情况导致内存不足错误，则必须从程序开始时构建并填充结构化数组而不是数据框：

arr = np.empty(rowcount, dtype=dfdtype)
...
adapt the code you use to populate the df and put it here
...

一旦有了arr，您将按照以下目标进行分组：

arr.sort(order=['UserID', 'Rating'])

ix = np.zeros(arr.shape[0], np.int8)
np.subtract(arr['UserID'][1:], arr['UserID'][:-1], out=ix[:-1])
ix[:-1] += ix[1:]
ix[-2:] = 1
ix = ix.view(np.bool)
print(arr[ix])

以上大小计算和dtype假定没有UserID大于4,294,967,295，没有MovieID大于65535，并且没有评分大于255。这意味着您的数据框的列可以为(np.uint32, np.uint16, np.uint8)，而不会丢失任何数据。

Answer 2

如果您想继续使用熊猫，可以将数据分为几批-例如一次10K行。您可以在将源数据加载到DF之后拆分数据，或者甚至更好地将数据分批加载。
您可以将每次迭代（批处理）的结果保存到词典中，仅保留您感兴趣的电影数量：

{userID: {MovieID_1: score1, MovieID_2: s2, ... MovieID_N: sN}, ...}

，并在每次迭代时更新嵌套字典，每位用户仅保留最佳N部电影。

这样，您将能够分析比计算机内存大得多的数据

在每个组中查找前N个值，共2亿行

2 个答案:

快速而肮脏的答案

更多有效记忆的答案