Question

我有以下 ds ：

|-- created_at: timestamp (nullable = true)
|-- channel_source_id: integer (nullable = true)
|-- movie_id: integer (nullable = true)

我想根据某些条件对 movie_id 进行分类：

已播放的次数;

#Count ocurrences by ID
SELECT COUNT(createad_at) FROM logs GROUP BY movie_id

电影播放的范围（ created_at ）是多少;

#Returns distinct movies_id
SELECT DISTINCT(movie_id) FROM logs

#For each movie_id, I would like to retrieve the hour that has been played
#When i have the result, I could apply an filter from df to extract the intervals
SELECT created_at FROM logs WHERE movie_id = ?

播放电影的不同数量 channel_source_id ;

#Count number of channels that have played
SELECT COUNT(DISTINCT(channel_source_id)) FROM logs where movie_id = ? GROUP BY movie_id

我写了一个简单的表来帮助我进行分类

Played 1 to 5 times, range between 00:00:00 - 03:59:59, 1 to 3 different channels >> Movie Type A
Played 6 to 10 times, range between 04:00:00 - 07:59:59, 4 to 5 different channels >> Movie Type B
etc

我正在使用Spark导入文件，但我失去了如何执行分类。任何人都可以帮我看看我应该从哪里开始？

def run() = {
  val sqlContext = new SQLContext(sc)
  val df = sqlContext.read
    .format("com.databricks.spark.csv")
    .options(Map("header" -> "true", "inferSchema" -> "true"))
    .load("/home/plc/Desktop/movies.csv")
  df.printSchema()
}

使用Apache Spark对数据进行分类

0 个答案: