如何计算Spark SQL数据框中的百分比?

时间:2018-07-04 20:54:21

标签: scala apache-spark apache-spark-sql

我有aadhaar卡数据集。我需要找到前3个州,其中男性生成Aadhaar卡的比例最高。数据集包含以下数据:

Date,Registrar,Private_Agency,State,District,Sub_District,PinCode,Gender,Age,AadharGenerated,EnrolmentRejected,MobileNumProvided
20150420,Civil Supplies - A&N Islands,India Computer Technology,Andaman and Nicobar Islands,South Andaman,Ferrargunj,744105,F,91,1,0,0
20150420,Civil Supplies - A&N Islands,India Computer Technology,Andaman and Nicobar Islands,South Andaman,Port Blair,744101,F,4,1,0,0
20150420,Civil Supplies - A&N Islands,India Computer Technology,Andaman and Nicobar Islands,South Andaman,Port Blair,744101,F,5,1,0,0
20150420,Civil Supplies - A&N Islands,India Computer Technology,Andaman and Nicobar Islands,South Andaman,Port Blair,744101,F,8,1,0,0
20150420,Civil Supplies - A&N Islands,India Computer Technology,Andaman and Nicobar Islands,South Andaman,Port Blair,744101,F,11,1,0,0
20150420,Civil Supplies - A&N Islands,India Computer Technology,Andaman and Nicobar Islands,South Andaman,Port Blair,744101,F,12,1,0,0
20150420,Civil Supplies - A&N Islands,India Computer Technology,Andaman and Nicobar Islands,South Andaman,Port Blair,744101,F,17,1,0,0
20150420,Civil Supplies - A&N Islands,India Computer Technology,Andaman and Nicobar Islands,South Andaman,Port Blair,744101,F,28,2,0,0
20150420,Civil Supplies - A&N Islands,India Computer Technology,Andaman and Nicobar Islands,South Andaman,Port Blair,744101,F,30,1,0,0
20150420,Civil Supplies - A&N Islands,India Computer Technology,Andaman and Nicobar Islands,South Andaman,Port Blair,744101,F,31,1,0,0
20150420,Civil Supplies - A&N Islands,India Computer Technology,Andaman and Nicobar Islands,South Andaman,Port Blair,744101,F,34,2,0,0
20150420,Civil Supplies - A&N Islands,India Computer Technology,Andaman and Nicobar Islands,South Andaman,Port Blair,744101,F,39,1,0,0
20150420,Civil Supplies - A&N Islands,India Computer Technology,Andaman and Nicobar Islands,South Andaman,Port Blair,744101,F,44,1,0,0
20150420,Civil Supplies - A&N Islands,India Computer Technology,Andaman and Nicobar Islands,South Andaman,Port Blair,744101,M,29,1,0,0
20150420,Civil Supplies - A&N Islands,India Computer Technology,Andaman and Nicobar Islands,South Andaman,Port Blair,744101,M,38,1,0,0
20150420,Civil Supplies - A&N Islands,India Computer Technology,Andaman and Nicobar Islands,South Andaman,Port Blair,744101,M,45,1,0,0
20150420,Civil Supplies - A&N Islands,India Computer Technology,Andaman and Nicobar Islands,South Andaman,Port Blair,744101,M,64,1,0,0
20150420,Civil Supplies - A&N Islands,India Computer Technology,Andaman and Nicobar Islands,South Andaman,Port Blair,744101,M,66,1,0,0
20150420,Civil Supplies - A&N Islands,India Computer Technology,Andaman and Nicobar Islands,South Andaman,Port Blair,744101,M,75,1,0,0
20150420,Civil Supplies - A&N Islands,India Computer Technology,Andaman and Nicobar Islands,South Andaman,Port Blair,744103,F,9,1,0,0
20150420,Civil Supplies - A&N Islands,India Computer Technology,Andaman and Nicobar Islands,South Andaman,Port Blair,744103,F,44,1,0,0
20150420,Civil Supplies - A&N Islands,India Computer Technology,Andaman and Nicobar Islands,South Andaman,Port Blair,744103,F,54,1,0,0
20150420,Civil Supplies - A&N Islands,India Computer Technology,Andaman and Nicobar Islands,South Andaman,Port Blair,744103,F,59,1,0,0
20150420,Civil Supplies - A&N Islands,India Computer Technology,Andaman and Nicobar Islands,South Andaman,Port Blair,744103,M,27,1,0,0
20150420,Civil Supplies - A&N Islands,India Computer Technology,Andaman and Nicobar Islands,South Andaman,Port Blair,744103,M,29,1,0,0
20150420,Bank Of India,Frontech Systems Pvt Ltd,Andhra Pradesh,Krishna,Kanchikacherla,521185,M,40,1,0,0
20150420,CSC e-Governance Services India Limited,BASIX,Andhra Pradesh,Srikakulam,Veeraghattam,532460,F,24,1,0,0

我尝试但出现错误:

sqlC.sql("SELECT STATE,
          (MALEADHAR/ADHAARDATA*100) AS PERCENTMALE 
         FROM 
                (SELECT STATE,SUM(ADHAARDATA) AS MALEADHAR 
                 FROM 
                       (SELECT State, SUM(AadharGenerated) AS ADHAARDATA
                         FROM data Group By State)
                         where Gender==='M') AS MALEADHAR 
                          GROUP BY STATE") 
                 SELECT STATE, SUM(AadharGenerated) AS MALEADAHAR FROM data where Gender='M' GROUP BY STATE")

请更正查询。

谢谢, 安奇

6 个答案:

答案 0 :(得分:1)

除了使用SQL查询,您还可以简单地使用spark的内置函数。为了使用这些功能,您需要首先根据数据创建数据框架:

 import org.apache.spark.sql.Row;
 import org.apache.spark.sql.types.{StructType, StructField, StringType,IntegerType}; 

#Schema
val schema = new StructType(
Array(
   StructField("date",IntegerType,true),
  StructField("registrar",StringType,true),
  StructField("private_agency",StringType,true),
  StructField("state",StringType,true),
  StructField("district",StringType,true),
  StructField("sub_district",StringType,true),
  StructField("pincode",IntegerType,true),
  StructField("gender",StringType,true),
  StructField("age",IntegerType,true),
  StructField("aadhar_generated",IntegerType,true),
  StructField("rejected",IntegerType,true),
  StructField("mobile_number",IntegerType,true),
  StructField("email_id",IntegerType,true)
  )
)


#Loading data

    val data = spark.read.option("header", "false").schema(schema).csv("aadhaar_data.csv")





#query 

data.groupBy("state", "gender").agg(sum("aadhar_generated")).filter(col("gender") === "M").orderBy(desc("sum(aadhar_generated)"))  

data.show

答案 1 :(得分:0)

在研究了这种相关的简单方法之后应用。也可以使用其他方法,但是这是一种简单的方法,您可以通过过滤等方式对一个或一个分组进行相应的调整。

import org.apache.spark.sql.functions._

val df = sc.parallelize(Seq(
   ("A", "X", 2, 100), ("A", "X", 7, 100), ("B", "X", 10, 100),
   ("C", "X", 1, 100), ("D", "X", 50, 100), ("E", "X", 30, 100)
    )).toDF("c1", "c2", "Val1", "Val2")

val total = df.select(col("Val1")).rdd.map(_(0).asInstanceOf[Int]).reduce(_+_)
// Or val total2: Long = df.agg(sum("Val1").cast("long")).first.getLong(0)

val df2 = df.groupBy($"c1").sum("Val1")
val df3 = df2.withColumn("perc_total", ($"sum(val1)" / total))

df3.show

礼物:

+---+---------+----------+
| c1|sum(Val1)|perc_total|
+---+---------+----------+
|  E|       30|       0.3|
|  B|       10|       0.1|
|  D|       50|       0.5|
|  C|        1|      0.01|
|  A|        9|      0.09|
+---+---------+----------+

答案 2 :(得分:0)

继续,我记得更好的方法!

import org.apache.spark.sql.expressions._
import org.apache.spark.sql.functions._

val df = sc.parallelize(Seq(
   ("A", "X", 2, 100), ("A", "X", 7, 100), ("B", "X", 10, 100),
   ("C", "X", 1, 100), ("D", "X", 50, 100), ("E", "X", 30, 100)
    )).toDF("c1", "c2", "Val1", "Val2")

val df2 = df
  .groupBy("c1")
  .agg(sum("Val1").alias("sum"))
  .withColumn("fraction", col("sum") /  sum("sum").over())

df2.show

答案 3 :(得分:0)

此外,对于SQL,只需添加额外的过滤等。这就是降低成本的地方。

df.createOrReplaceTempView("SOQTV")

spark.sql(" SELECT c1, SUM(Val1) / (SELECT SUM(Val1) FROM SOQTV) as Perc_Total_for_SO_Question  " +
      " FROM SOQTV " + 
      " GROUP BY c1 ").show()

得到相同的答案。

答案 4 :(得分:0)

一种更简单的方法-即,当然当然也可以使用嵌套SQL,但同时使用SQL和DF的更逐步的方法。

请注意,缺少给定组合(在本例中为c1)表示0%,否则可以解决。

您现在可以以相同的方式进行调整,因为我提供了类似的变量名。您可以排序,删除,重命名列。

import org.apache.spark.sql.functions._

val df = sc.parallelize(Seq(
  ("A", "X", 2, 100, "M", "Y"), ("F", "X", 7, 100, "M", "Y"), ("B", "X", 10, 100, "F", "Y"),
  ("C", "X", 1, 100, "F", "N"), ("D", "X", 50, 100, "M", "N"), ("E", "X", 30, 100, "M", "Y"),
  ("D", "X", 1, 100, "F", "N"), ("A", "X", 50, 100, "M", "N"), ("A", "X", 30, 100, "M", "Y"),
  ("D", "X", 1, 100, "M", "N"), ("X", "X", 50, 100, "M", "Y"), ("A", "X", 30, 100, "F", "Y"),
  ("K", "X", 1, 100, "M", "N"), ("K", "X", 50, 100, "M", "Y")
)).toDF("c1", "c2", "Val1", "Val2", "male_Female_Flag", "has_This")

df.createOrReplaceTempView("SOQTV")

spark.sql(
   "select * " +
   "from SOQTV " +
   "where 1 = 1 order by 1,5,6 ").show()

val dfA = spark.sql(" SELECT c1, count(*) " +
      " FROM SOQTV " + 
      " WHERE male_Female_Flag = 'M' " +
      " GROUP BY c1 ")

 val dfB = spark.sql(" SELECT c1, count(*) " +
      " FROM SOQTV " + 
      " WHERE male_Female_Flag = 'M' AND has_This = 'Y' " +
      " GROUP BY c1 ")

 val dfC = dfB.join(dfA, dfA("c1") === dfB("c1"), "inner")
 val colNames = Seq("c1", "Male_Has_Something", "c1_Again", "Male")
 val dfD = dfC.toDF(colNames: _*)

 dfC.show
 dfD.show
 dfD.withColumn("Percentage", (col("Male_Has_Something") / col("Male")) * 100 ).show

这给出了:

 +---+------------------+--------+----+-----------------+
 | c1|Male_Has_Something|c1_Again|Male|       Percentage|
 +---+------------------+--------+----+-----------------+
 |  K|                 1|       K|   2|             50.0|
 |  F|                 1|       F|   1|            100.0|
 |  E|                 1|       E|   1|            100.0|
 |  A|                 2|       A|   3|66.66666666666666|
 |  X|                 1|       X|   1|            100.0|
 +---+------------------+--------+----+-----------------+

答案 5 :(得分:0)

使用窗口函数接受答案的另一个版本

import org.apache.spark.sql.expressions._
import org.apache.spark.sql.functions._

val df = sc.parallelize(Seq(
   ("A", "X", 2, 100), ("A", "X", 7, 100), ("B", "X", 10, 100),
   ("C", "X", 1, 100), ("D", "X", 50, 100), ("E", "X", 30, 100)
    )).toDF("c1", "c2", "Val1", "Val2")
df.show
println("using group by agg")
val df2 = df
  .groupBy("c1")
  .agg(sum("Val1").alias("sum"))
  .withColumn("fraction", col("sum") /  sum("sum").over())
df2.show
println("using window function sum")
 df.createOrReplaceTempView("test")
 spark.sql("select  distinct c1, concat( sum(Val1) over(Partition by c1)/(sum(Val1) over()) * 100 , '%') as percent from test ").show

 

结果:

+---+---+----+----+
| c1| c2|Val1|Val2|
+---+---+----+----+
|  A|  X|   2| 100|
|  A|  X|   7| 100|
|  B|  X|  10| 100|
|  C|  X|   1| 100|
|  D|  X|  50| 100|
|  E|  X|  30| 100|
+---+---+----+----+

using group by agg
+---+---+--------+
| c1|sum|fraction|
+---+---+--------+
|  A|  9|    0.09|
|  B| 10|     0.1|
|  C|  1|    0.01|
|  D| 50|     0.5|
|  E| 30|     0.3|
+---+---+--------+

using window function sum
+---+-------+
| c1|percent|
+---+-------+
|  A|   9.0%|
|  B|  10.0%|
|  C|   1.0%|
|  D|  50.0%|
|  E|  30.0%|
+---+-------+