2个数据框架上的外部连接:Spark Scala SqlContext

时间:2016-07-01 15:51:24

标签: scala join apache-spark apache-spark-sql spark-dataframe

我在2个数据帧上进行外连接时遇到错误。我想要获得百分位数。

     val sqlContext = new org.apache.spark.sql.SQLContext(sc)
            val df = sqlContext.jsonFile("temp.txt")        
            val res =  df.withColumn("visited", explode($"visited"))

 val result1 =res.groupBy($"customerId", $"visited.placeName").agg(count("*").alias("total"))

            val result2 = res
            .filter($"visited.rating" < 4)
              .groupBy($"customerId", $"visited.placeName")  
              .agg(count("*").alias("top"))

            result1.show()

            result2.show()
           val temp = result1.join(result2, List("placeName","customerId"), "outer")
    temp.registerTempTable("percentile")

             sqlContext.sql("select top/total as Percentage from percentile groupBy placeName") 

我得到的错误:

<console>:43: error: type mismatch; found : List[String] required: org.apache.spark.sql.Column –

我正在使用spark 1.5

谁能告诉我这里做错了什么?我也尝试过这个: val temp = result1.join(result2, Seq("placeName","customerId"), "outer")但仍然得到:

found : Seq[String] required: org.apache.spark.sql.Column

有了这个:

val res =  df.withColumn("visited", explode($"visited"))

val file1 =res.groupBy( $"visited.placeName").agg(count("*").alias("total"))

val file2 = res.filter($"visited.rating" > 3).groupBy($"visited.placeName").agg(count("*").alias("top"))

file1.show()

file2.show()


file1.join(file2)  

我正在获取重复的列。

我的架构:

 {
        "country": "France",
        "customerId": "France001",
        "visited": [
            {
                "placeName": "US",
                "rating": "2",
                "famousRest": "N/A",
                "placeId": "AVBS34"

            },
              {
                "placeName": "US",
                "rating": "3",
                "famousRest": "SeriousPie",
                "placeId": "VBSs34"

            },
              {
                "placeName": "Canada",
                "rating": "3",
                "famousRest": "TimHortons",
                "placeId": "AVBv4d"

            }        
    ]
}

US top = 1 count = 3
Canada top = 1 count = 3


{
        "country": "Canada",
        "customerId": "Canada012",
        "visited": [
            {
                "placeName": "UK",
                "rating": "3",
                "famousRest": "N/A",
                "placeId": "XSdce2"

            },


    ]
}
UK top = 1 count = 1


{
        "country": "France",
        "customerId": "France001",
        "visited": [
            {
                "placeName": "US",
                "rating": "4.3",
                "famousRest": "N/A",
                "placeId": "AVBS34"

            },
              {
                "placeName": "US",
                "rating": "3.3",
                "famousRest": "SeriousPie",
                "placeId": "VBSs34"

            },
              {
                "placeName": "Canada",
                "rating": "4.3",
                "famousRest": "TimHortons",
                "placeId": "AVBv4d"

            }        
    ]
}

US top = 2 count = 3
Canada top = 1 count = 3

所以最后我需要这样的东西:

PlaceName  Percentage
US         57.14            (1+1+2)/(3+1+3) *100
Canada     33.33            (1+1)/(3+3) *100
UK         100               1*100

架构:

root
|-- country: string(nullable=true)
|-- customerId:string(nullable=true)
|-- visited: array (nullable = true)
|    |-- element: struct (containsNull = true)
|    |   |-- placeId: string (nullable = true)
|    |   |-- placeName: string (nullable = true) 
|    |   |-- famousRest: string (nullable = true)
|    |   |-- rating: string (nullable = true)

0 个答案:

没有答案