我有一个数据框,包含两个column-movieid以及以下格式应用于该电影的标签 -
movieid tag
1 animation
1 pixar
1 animation
2 comedy
我想为每个电影ID计算每个标签的应用次数,并且还要计算应用于每个电影的标签总数。我是新来的。
答案 0 :(得分:0)
这是在PySpark中,这里是:
创建df:
sqlContext = SQLContext(sc)
data = [(1,'animation'),(1,'pixar'),(1,'animation'),(2,'comedy')]
RDD = sc.parallelize(data)
orders_df = sqlContext.createDataFrame(RDD,["movieid","tag"])
orders_df.show()
+-------+---------+
|movieid| tag|
+-------+---------+
| 1|animation|
| 1| pixar|
| 1|animation|
| 2| comedy|
+-------+---------+
计算:
orders_df.groupBy(['movieid','tag']).count().show() #count for each movie id how many times each tags are applied
+-------+---------+-----+
|movieid| tag|count|
+-------+---------+-----+
| 1| pixar| 1|
| 1|animation| 2|
| 2| comedy| 1|
+-------+---------+-----+
orders_df.groupBy(['movieid']).count().show() #number of tags applied to each movie
+-------+-----+
|movieid|count|
+-------+-----+
| 1| 3|
| 2| 1|
+-------+-----+