Question

我有这个数据框。

from pandas import DataFrame
import pandas as pd

df = pd.DataFrame({'userId': [10,20,10,20,10,20,60,90,60,90,60,90,30,40,30,40,30,40,50,60,50,60,50,60],
                   'movieId': [500,500,800,800,700,700,1100,1100,1900,1900,2000,2000,1600,1600,1901,1901,3000,3000,3025,3025,4000,4000,500,500],  
                   'ratings': [3.5,4.5,2.0,5.0,4.0,1.5,3.5,4.5,3.5,4.5,2.0,5.0,4.0,1.5,3.5,4.5,3.5,4.5,2.0,5.0,4.0,1.5,3.5,4.5]})

df
    userId  movieId  ratings
0       10      500      3.5
1       20      500      4.5
2       10      800      2.0
3       20      800      5.0
4       10      700      4.0
5       20      700      1.5
6       60     1100      3.5
7       90     1100      4.5
8       60     1900      3.5
9       90     1900      4.5
10      60     2000      2.0
11      90     2000      5.0
12      30     1600      4.0
13      40     1600      1.5
14      30     1901      3.5
15      40     1901      4.5
16      30     3000      3.5
17      40     3000      4.5
18      50     3025      2.0
19      60     3025      5.0
20      50     4000      4.0
21      60     4000      1.5
22      50      500      3.5
23      60      500      4.5

在此数据框中，两个用户之间有共同的电影。
userId可以成对使用，以了解目的e.g.[(10,20),(60,90),(30,40),(50,60)]。
因为所有这些对之间都有共同的电影。每隔6个条目之后，将开始新的配对条目。
此外，一个用户可以在此数据框中显示为多个配对，例如userId = 60是两次。
我想从每对中选择e.g. first 4个条目。

**Expected Outcome**

    userId  movieId  ratings
0       10      500      3.5
1       20      500      4.5
2       10      800      2.0
3       20      800      5.0

6       60     1100      3.5
7       90     1100      4.5
8       60     1900      3.5
9       90     1900      4.5

12      30     1600      4.0
13      40     1600      1.5
14      30     1901      3.5
15      40     1901      4.5

18      50     3025      2.0
19      60     3025      5.0
20      50     4000      4.0
21      60     4000      1.5

Answer 1

您可以使用Series.map将对转换为每个组的元组，然后调用GroupBy.head：

s = df['movieId'].map(df.groupby('movieId')['userId'].apply(tuple))

df = df.groupby(s).head(6)
print (df)
    userId  movieId  ratings
0       10      500      3.5
1       20      500      4.5
2       10      800      2.0
3       20      800      5.0
4       10      700      4.0
5       20      700      1.5
8       30     1900      3.5
9       40     1900      4.5
10      30     2000      2.0
11      40     2000      5.0
12      30     1600      4.0
13      40     1600      1.5
16      50     3000      3.5
17      60     3000      4.5
18      50     3025      2.0
19      60     3025      5.0
20      50     4000      4.0
21      60     4000      1.5

编辑：

如果需要通过连续的movieID进行过滤：

tmp = df['movieId'].ne(df['movieId'].shift()).cumsum()
s = tmp.map(df.groupby(tmp)['userId'].apply(tuple))
df = df.groupby(s).head(4)
print (df)
    userId  movieId  ratings
0       10      500      3.5
1       20      500      4.5
2       10      800      2.0
3       20      800      5.0
6       60     1100      3.5
7       90     1100      4.5
8       60     1900      3.5
9       90     1900      4.5
12      30     1600      4.0
13      40     1600      1.5
14      30     1901      3.5
15      40     1901      4.5
18      50     3025      2.0
19      60     3025      5.0
20      50     4000      4.0
21      60     4000      1.5

编辑：

在选择第一个4行之后最好排除每2行吗？它将完成工作。有什么建议么？我的意思是，它将选择4个，然后删除下一个2，再选择另一个4，然后删除下2个，依此类推。

您可以对索引值使用6的模，然后按条件和boolean indexing进行过滤：

#for default RangeIndex
#df = df.reset_index(drop=True)
df = df[df.index % 6 < 4]
print (df)
    userId  movieId  ratings
0       10      500      3.5
1       20      500      4.5
2       10      800      2.0
3       20      800      5.0
6       60     1100      3.5
7       90     1100      4.5
8       60     1900      3.5
9       90     1900      4.5
12      30     1600      4.0
13      40     1600      1.5
14      30     1901      3.5
15      40     1901      4.5
18      50     3025      2.0
19      60     3025      5.0
20      50     4000      4.0
21      60     4000      1.5

检索按熊猫分组的特定行数

1 个答案: