我有以下 ds :
|-- created_at: timestamp (nullable = true)
|-- channel_source_id: integer (nullable = true)
|-- movie_id: integer (nullable = true)
我想根据某些条件对 movie_id 进行分类:
已播放的次数;
#Count ocurrences by ID
SELECT COUNT(createad_at) FROM logs GROUP BY movie_id
电影播放的范围( created_at )是多少;
#Returns distinct movies_id
SELECT DISTINCT(movie_id) FROM logs
#For each movie_id, I would like to retrieve the hour that has been played
#When i have the result, I could apply an filter from df to extract the intervals
SELECT created_at FROM logs WHERE movie_id = ?
播放电影的不同数量 channel_source_id ;
#Count number of channels that have played
SELECT COUNT(DISTINCT(channel_source_id)) FROM logs where movie_id = ? GROUP BY movie_id
我写了一个简单的表来帮助我进行分类
Played 1 to 5 times, range between 00:00:00 - 03:59:59, 1 to 3 different channels >> Movie Type A
Played 6 to 10 times, range between 04:00:00 - 07:59:59, 4 to 5 different channels >> Movie Type B
etc
我正在使用Spark导入文件,但我失去了如何执行分类。任何人都可以帮我看看我应该从哪里开始?
def run() = {
val sqlContext = new SQLContext(sc)
val df = sqlContext.read
.format("com.databricks.spark.csv")
.options(Map("header" -> "true", "inferSchema" -> "true"))
.load("/home/plc/Desktop/movies.csv")
df.printSchema()
}