我有aadhaar卡数据集。我需要找到前3个州,其中男性生成Aadhaar卡的比例最高。数据集包含以下数据:
Date,Registrar,Private_Agency,State,District,Sub_District,PinCode,Gender,Age,AadharGenerated,EnrolmentRejected,MobileNumProvided
20150420,Civil Supplies - A&N Islands,India Computer Technology,Andaman and Nicobar Islands,South Andaman,Ferrargunj,744105,F,91,1,0,0
20150420,Civil Supplies - A&N Islands,India Computer Technology,Andaman and Nicobar Islands,South Andaman,Port Blair,744101,F,4,1,0,0
20150420,Civil Supplies - A&N Islands,India Computer Technology,Andaman and Nicobar Islands,South Andaman,Port Blair,744101,F,5,1,0,0
20150420,Civil Supplies - A&N Islands,India Computer Technology,Andaman and Nicobar Islands,South Andaman,Port Blair,744101,F,8,1,0,0
20150420,Civil Supplies - A&N Islands,India Computer Technology,Andaman and Nicobar Islands,South Andaman,Port Blair,744101,F,11,1,0,0
20150420,Civil Supplies - A&N Islands,India Computer Technology,Andaman and Nicobar Islands,South Andaman,Port Blair,744101,F,12,1,0,0
20150420,Civil Supplies - A&N Islands,India Computer Technology,Andaman and Nicobar Islands,South Andaman,Port Blair,744101,F,17,1,0,0
20150420,Civil Supplies - A&N Islands,India Computer Technology,Andaman and Nicobar Islands,South Andaman,Port Blair,744101,F,28,2,0,0
20150420,Civil Supplies - A&N Islands,India Computer Technology,Andaman and Nicobar Islands,South Andaman,Port Blair,744101,F,30,1,0,0
20150420,Civil Supplies - A&N Islands,India Computer Technology,Andaman and Nicobar Islands,South Andaman,Port Blair,744101,F,31,1,0,0
20150420,Civil Supplies - A&N Islands,India Computer Technology,Andaman and Nicobar Islands,South Andaman,Port Blair,744101,F,34,2,0,0
20150420,Civil Supplies - A&N Islands,India Computer Technology,Andaman and Nicobar Islands,South Andaman,Port Blair,744101,F,39,1,0,0
20150420,Civil Supplies - A&N Islands,India Computer Technology,Andaman and Nicobar Islands,South Andaman,Port Blair,744101,F,44,1,0,0
20150420,Civil Supplies - A&N Islands,India Computer Technology,Andaman and Nicobar Islands,South Andaman,Port Blair,744101,M,29,1,0,0
20150420,Civil Supplies - A&N Islands,India Computer Technology,Andaman and Nicobar Islands,South Andaman,Port Blair,744101,M,38,1,0,0
20150420,Civil Supplies - A&N Islands,India Computer Technology,Andaman and Nicobar Islands,South Andaman,Port Blair,744101,M,45,1,0,0
20150420,Civil Supplies - A&N Islands,India Computer Technology,Andaman and Nicobar Islands,South Andaman,Port Blair,744101,M,64,1,0,0
20150420,Civil Supplies - A&N Islands,India Computer Technology,Andaman and Nicobar Islands,South Andaman,Port Blair,744101,M,66,1,0,0
20150420,Civil Supplies - A&N Islands,India Computer Technology,Andaman and Nicobar Islands,South Andaman,Port Blair,744101,M,75,1,0,0
20150420,Civil Supplies - A&N Islands,India Computer Technology,Andaman and Nicobar Islands,South Andaman,Port Blair,744103,F,9,1,0,0
20150420,Civil Supplies - A&N Islands,India Computer Technology,Andaman and Nicobar Islands,South Andaman,Port Blair,744103,F,44,1,0,0
20150420,Civil Supplies - A&N Islands,India Computer Technology,Andaman and Nicobar Islands,South Andaman,Port Blair,744103,F,54,1,0,0
20150420,Civil Supplies - A&N Islands,India Computer Technology,Andaman and Nicobar Islands,South Andaman,Port Blair,744103,F,59,1,0,0
20150420,Civil Supplies - A&N Islands,India Computer Technology,Andaman and Nicobar Islands,South Andaman,Port Blair,744103,M,27,1,0,0
20150420,Civil Supplies - A&N Islands,India Computer Technology,Andaman and Nicobar Islands,South Andaman,Port Blair,744103,M,29,1,0,0
20150420,Bank Of India,Frontech Systems Pvt Ltd,Andhra Pradesh,Krishna,Kanchikacherla,521185,M,40,1,0,0
20150420,CSC e-Governance Services India Limited,BASIX,Andhra Pradesh,Srikakulam,Veeraghattam,532460,F,24,1,0,0
我尝试但出现错误:
sqlC.sql("SELECT STATE,
(MALEADHAR/ADHAARDATA*100) AS PERCENTMALE
FROM
(SELECT STATE,SUM(ADHAARDATA) AS MALEADHAR
FROM
(SELECT State, SUM(AadharGenerated) AS ADHAARDATA
FROM data Group By State)
where Gender==='M') AS MALEADHAR
GROUP BY STATE")
SELECT STATE, SUM(AadharGenerated) AS MALEADAHAR FROM data where Gender='M' GROUP BY STATE")
请更正查询。
谢谢, 安奇
答案 0 :(得分:1)
除了使用SQL查询,您还可以简单地使用spark的内置函数。为了使用这些功能,您需要首先根据数据创建数据框架:
import org.apache.spark.sql.Row;
import org.apache.spark.sql.types.{StructType, StructField, StringType,IntegerType};
#Schema
val schema = new StructType(
Array(
StructField("date",IntegerType,true),
StructField("registrar",StringType,true),
StructField("private_agency",StringType,true),
StructField("state",StringType,true),
StructField("district",StringType,true),
StructField("sub_district",StringType,true),
StructField("pincode",IntegerType,true),
StructField("gender",StringType,true),
StructField("age",IntegerType,true),
StructField("aadhar_generated",IntegerType,true),
StructField("rejected",IntegerType,true),
StructField("mobile_number",IntegerType,true),
StructField("email_id",IntegerType,true)
)
)
#Loading data
val data = spark.read.option("header", "false").schema(schema).csv("aadhaar_data.csv")
#query
data.groupBy("state", "gender").agg(sum("aadhar_generated")).filter(col("gender") === "M").orderBy(desc("sum(aadhar_generated)"))
data.show
答案 1 :(得分:0)
在研究了这种相关的简单方法之后应用。也可以使用其他方法,但是这是一种简单的方法,您可以通过过滤等方式对一个或一个分组进行相应的调整。
import org.apache.spark.sql.functions._
val df = sc.parallelize(Seq(
("A", "X", 2, 100), ("A", "X", 7, 100), ("B", "X", 10, 100),
("C", "X", 1, 100), ("D", "X", 50, 100), ("E", "X", 30, 100)
)).toDF("c1", "c2", "Val1", "Val2")
val total = df.select(col("Val1")).rdd.map(_(0).asInstanceOf[Int]).reduce(_+_)
// Or val total2: Long = df.agg(sum("Val1").cast("long")).first.getLong(0)
val df2 = df.groupBy($"c1").sum("Val1")
val df3 = df2.withColumn("perc_total", ($"sum(val1)" / total))
df3.show
礼物:
+---+---------+----------+
| c1|sum(Val1)|perc_total|
+---+---------+----------+
| E| 30| 0.3|
| B| 10| 0.1|
| D| 50| 0.5|
| C| 1| 0.01|
| A| 9| 0.09|
+---+---------+----------+
答案 2 :(得分:0)
继续,我记得更好的方法!
import org.apache.spark.sql.expressions._
import org.apache.spark.sql.functions._
val df = sc.parallelize(Seq(
("A", "X", 2, 100), ("A", "X", 7, 100), ("B", "X", 10, 100),
("C", "X", 1, 100), ("D", "X", 50, 100), ("E", "X", 30, 100)
)).toDF("c1", "c2", "Val1", "Val2")
val df2 = df
.groupBy("c1")
.agg(sum("Val1").alias("sum"))
.withColumn("fraction", col("sum") / sum("sum").over())
df2.show
答案 3 :(得分:0)
此外,对于SQL,只需添加额外的过滤等。这就是降低成本的地方。
df.createOrReplaceTempView("SOQTV")
spark.sql(" SELECT c1, SUM(Val1) / (SELECT SUM(Val1) FROM SOQTV) as Perc_Total_for_SO_Question " +
" FROM SOQTV " +
" GROUP BY c1 ").show()
得到相同的答案。
答案 4 :(得分:0)
一种更简单的方法-即,当然当然也可以使用嵌套SQL,但同时使用SQL和DF的更逐步的方法。
请注意,缺少给定组合(在本例中为c1)表示0%,否则可以解决。
您现在可以以相同的方式进行调整,因为我提供了类似的变量名。您可以排序,删除,重命名列。
import org.apache.spark.sql.functions._
val df = sc.parallelize(Seq(
("A", "X", 2, 100, "M", "Y"), ("F", "X", 7, 100, "M", "Y"), ("B", "X", 10, 100, "F", "Y"),
("C", "X", 1, 100, "F", "N"), ("D", "X", 50, 100, "M", "N"), ("E", "X", 30, 100, "M", "Y"),
("D", "X", 1, 100, "F", "N"), ("A", "X", 50, 100, "M", "N"), ("A", "X", 30, 100, "M", "Y"),
("D", "X", 1, 100, "M", "N"), ("X", "X", 50, 100, "M", "Y"), ("A", "X", 30, 100, "F", "Y"),
("K", "X", 1, 100, "M", "N"), ("K", "X", 50, 100, "M", "Y")
)).toDF("c1", "c2", "Val1", "Val2", "male_Female_Flag", "has_This")
df.createOrReplaceTempView("SOQTV")
spark.sql(
"select * " +
"from SOQTV " +
"where 1 = 1 order by 1,5,6 ").show()
val dfA = spark.sql(" SELECT c1, count(*) " +
" FROM SOQTV " +
" WHERE male_Female_Flag = 'M' " +
" GROUP BY c1 ")
val dfB = spark.sql(" SELECT c1, count(*) " +
" FROM SOQTV " +
" WHERE male_Female_Flag = 'M' AND has_This = 'Y' " +
" GROUP BY c1 ")
val dfC = dfB.join(dfA, dfA("c1") === dfB("c1"), "inner")
val colNames = Seq("c1", "Male_Has_Something", "c1_Again", "Male")
val dfD = dfC.toDF(colNames: _*)
dfC.show
dfD.show
dfD.withColumn("Percentage", (col("Male_Has_Something") / col("Male")) * 100 ).show
这给出了:
+---+------------------+--------+----+-----------------+
| c1|Male_Has_Something|c1_Again|Male| Percentage|
+---+------------------+--------+----+-----------------+
| K| 1| K| 2| 50.0|
| F| 1| F| 1| 100.0|
| E| 1| E| 1| 100.0|
| A| 2| A| 3|66.66666666666666|
| X| 1| X| 1| 100.0|
+---+------------------+--------+----+-----------------+
答案 5 :(得分:0)
使用窗口函数接受答案的另一个版本
import org.apache.spark.sql.expressions._
import org.apache.spark.sql.functions._
val df = sc.parallelize(Seq(
("A", "X", 2, 100), ("A", "X", 7, 100), ("B", "X", 10, 100),
("C", "X", 1, 100), ("D", "X", 50, 100), ("E", "X", 30, 100)
)).toDF("c1", "c2", "Val1", "Val2")
df.show
println("using group by agg")
val df2 = df
.groupBy("c1")
.agg(sum("Val1").alias("sum"))
.withColumn("fraction", col("sum") / sum("sum").over())
df2.show
println("using window function sum")
df.createOrReplaceTempView("test")
spark.sql("select distinct c1, concat( sum(Val1) over(Partition by c1)/(sum(Val1) over()) * 100 , '%') as percent from test ").show
结果:
+---+---+----+----+
| c1| c2|Val1|Val2|
+---+---+----+----+
| A| X| 2| 100|
| A| X| 7| 100|
| B| X| 10| 100|
| C| X| 1| 100|
| D| X| 50| 100|
| E| X| 30| 100|
+---+---+----+----+
using group by agg
+---+---+--------+
| c1|sum|fraction|
+---+---+--------+
| A| 9| 0.09|
| B| 10| 0.1|
| C| 1| 0.01|
| D| 50| 0.5|
| E| 30| 0.3|
+---+---+--------+
using window function sum
+---+-------+
| c1|percent|
+---+-------+
| A| 9.0%|
| B| 10.0%|
| C| 1.0%|
| D| 50.0%|
| E| 30.0%|
+---+-------+