使用Apache Spark对数据进行分类

时间:2016-08-01 14:27:33

标签: scala apache-spark apache-spark-sql

我有以下 ds

|-- created_at: timestamp (nullable = true)
|-- channel_source_id: integer (nullable = true)
|-- movie_id: integer (nullable = true)

我想根据某些条件对 movie_id 进行分类:

  • 已播放的次数;

    #Count ocurrences by ID
    SELECT COUNT(createad_at) FROM logs GROUP BY movie_id  
    
  • 电影播放的范围( created_at )是多少;

    #Returns distinct movies_id
    SELECT DISTINCT(movie_id) FROM logs
    
    #For each movie_id, I would like to retrieve the hour that has been played
    #When i have the result, I could apply an filter from df to extract the intervals
    SELECT created_at FROM logs WHERE movie_id = ?
    
  • 播放电影的不同数量 channel_source_id ;

    #Count number of channels that have played
    SELECT COUNT(DISTINCT(channel_source_id)) FROM logs where movie_id = ? GROUP BY movie_id
    

我写了一个简单的表来帮助我进行分类

Played 1 to 5 times, range between 00:00:00 - 03:59:59, 1 to 3 different channels >> Movie Type A
Played 6 to 10 times, range between 04:00:00 - 07:59:59, 4 to 5 different channels >> Movie Type B
etc

我正在使用Spark导入文件,但我失去了如何执行分类。任何人都可以帮我看看我应该从哪里开始?

def run() = {
  val sqlContext = new SQLContext(sc)
  val df = sqlContext.read
    .format("com.databricks.spark.csv")
    .options(Map("header" -> "true", "inferSchema" -> "true"))
    .load("/home/plc/Desktop/movies.csv")
  df.printSchema()
}

0 个答案:

没有答案