通过将其他列分组来求和

时间:2019-08-06 20:03:21

标签: dataframe pyspark aggregate

我的数据框如下

Occupation, Genre, Rating

我将所有评分的总和作为totalRating。现在,我想创建特定于职业,流派组合的neeew列w_rating,该列需要(rating> 3)/ totalRating。我的数据框名称为joinRDD,所以我在查询下面进行了书写

resultDF = joinedDF3.groupby([joinedDF3["Occupation"],joinedDF3["Genre"]]).withColumn(wa_rating, sum(Rating>3)/totalRating).collect()

但显示错误

 AttributeError: 'GroupedData' object has no attribute 'withColumn'

因此从错误中很明显,我们不能将withColumn与groupby一起使用

所以我的问题是怎么做?

下面是我更新的代码。

from pyspark.sql import SparkSession
from pyspark.sql.types import (StructField,StructType,IntegerType,StringType)
from pyspark.sql import Row
from pyspark.sql.functions import sum
import pyspark.sql.functions as F
from pyspark.sql.functions import lit

spark = SparkSession.builder.appName("Movielens Analysis").getOrCreate()




def refineMovieDF(row):
    genre=[]
    movieData =row[0].split("|")
    for i in range(len(movieData)-5):
        if int(movieData[i+5]) ==1:
            genre.append((int(movieData[0]),i))
    return genre



ratingSchema =StructType(fields=[StructField("UserId",IntegerType(),True),StructField("MovieId",IntegerType(),True),StructField("Rating",IntegerType(),True),StructField("TimeStamp",IntegerType(),True)])
ratingsDF = spark.read.load("ml-100k/u.data", format="csv",sep="\t", inferSchema=True, header=False,schema=ratingSchema)


genreSchema =StructType(fields=[StructField("Genre",StringType(),True),StructField("GenreId",IntegerType(),True)])
genreDF = spark.read.load("ml-100k/u.genre",format="csv",sep="|",inferSchema=True, header=False,schema=genreSchema)


userSchema =StructType(fields=[StructField("UserId",IntegerType(),True),StructField("Age",IntegerType(),True),StructField("Gender",StringType(),True),StructField("Occupation",StringType(),True),StructField("ZipCode",IntegerType(),True)])
usersDF = spark.read.load("ml-100k/u.user",format="csv",sep="|",inferSchema=True, header=False,schema=userSchema)

movieSchema =StructType(fields=[StructField("MovieRow",StringType(),True)])

movieDF = spark.read.load("ml-100k/u.item",format="csv",inferSchema=True, header=False,schema=movieSchema)



movieRefinedRDD = movieDF.rdd.flatMap(refineMovieDF)
movieSchema =StructType(fields=[StructField("MovieId",IntegerType(),True),StructField("GenreId",IntegerType(),True)])
movieRefinedDf = spark.createDataFrame(movieRefinedRDD, movieSchema)


joinedDF1 = ratingsDF.join(usersDF,ratingsDF.UserId==usersDF.UserId).select(usersDF["Occupation"],ratingsDF["Rating"],ratingsDF["MovieId"])

joinedDF3 = joinedDF1.join(joinedDF2,joinedDF1.MovieId == joinedDF2.MovieId).select(joinedDF1["Occupation"],joinedDF1["Rating"],joinedDF2["Genre"])
totalRating = joinedDF3.groupBy().sum("Rating").collect()
resultDF = joinedDF3.groupby([joinedDF3["Occupation"],joinedDF3["Genre"]]).agg((sum(joinedDF3["Rating"]>3)/totalRating).alias(wa_rating)).collect()
print(resultDF)

现在我遇到错误了。

2019-08-06 22:24:20 INFO  BlockManagerInfo:54 - Removed broadcast_11_piece0 on 10.0.2.15:58903 in memory (size: 4.3 KB, free: 413.8 MB)
Traceback (most recent call last):
  File "/home/cloudera/workspace/MovielensAnalysis.py", line 59, in <module>
resultDF = joinedDF3.groupby([joinedDF3["Occupation"],joinedDF3["Genre"]]).agg((sum(joinedDF3["Rating"]>3)/totalRating).alias(wa_rating)).collect()
  File "/usr/local/spark/python/lib/pyspark.zip/pyspark/sql/column.py", line 116, in _
  File "/usr/local/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
  File "/usr/local/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
  File "/usr/local/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o129.divide.: java.lang.RuntimeException: Unsupported literal type class java.util.ArrayList [[572536]]

0 个答案:

没有答案