我正在使用Apache Spark
从movie lens
数据集中提取前10个电影和流派。我能够提取收视率,但每部电影的类型都采用这种格式Action|War|Crime
。我使用|
拆分字符串,然后尝试将其插入ArrayBuffer
或ListBuffer
。它在循环内工作正常但是当我试图在循环外提取这些时我得到空结果。
val rating_data = spark.read.format("csv").option("header", "true").load("data/ratings.csv")
//read movie name file
val movie_name = spark.read.format("csv").option("header", "true").load("data/movies.csv")
//query for movie with rating 5, matching movie id in both csv files and extract the top 10 movies with highest count of 5
val movie_id = rating_data.filter(rating_data("rating").===(5)).groupBy("movieId").count().orderBy(org.apache.spark.sql.functions.col("count").desc).take(10)
var genre_list = Array[String]()
movie_id.foreach(a => {
#match the movies id in ratings and movie file
val mv = movie_name.filter(movie_name("movieId").===(a(0)))
mv.foreach(b => {
genre_list = b(2).toString.split('|').map(_.trim)
genre_list.foreach(g => {
mylist += g
#mylist not empty , has elements
})
})
})
println(mylist) #empty
}