我有以下数据:
53,Male,11th,<=50K
53,Male,11th,<=50K
53,Male,11th,<=50K
20,Female,Masters,>50K
20,Female,Masters,>50K
33,Male,Bachelors,<=50K
接下来,我需要使用select和group对上述数据进行分组。所以它会是这样的:
53,Male,11th,<=50K,3
20,Female,Masters,>50K,2
33,Male,Bachelors,<=50K,1
其中最后一个数字显示类似记录的数量。现在我需要过滤等效记录的数量&gt; 2,并将其存储在单独的文件中
我在Scala命令中通过sql查询对数据进行了分组。为了取消组合数据,我可以创建一个表并通过(插入命令)和逐行添加分组数据。它可以工作,但这非常非常慢,大约需要一个小时的记录。是否有任何想法使用Scala非常感谢。 命令如下所示:
import spark.sqlContext.implicits._
import scala.collection.immutable.Map
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.sql.types.{StructType, StructField, StringType, IntegerType}
case class Rating(age: Double,edu: String, sex: String, salary: String)
val Result = sc.textFile("hdfs://NameNode01:9000/input/adult.csv").map(_.split(",")).map(p => Rating(p(0).trim.toDouble,p(1),p(2),p(3))).toDF()
Result.registerTempTable("Start")
val sal1=spark.sqlContext.sql("SELECT age,edu,sex,salary,count(*) as cnt from Start group by age,edu,sex,salary")
sal1.registerTempTable("adult")
val sal2=spark.sqlContext.sql("SELECT age,edu,sex,salary,cnt from adult WHERE cnt>3")
sal2.registerTempTable("adult2")
var ag=sal2.map(age => ""+age(0)).collect()
var ed=sal2.map(edu => ""+edu(1)).collect()
var se=sal2.map(sex => ""+sex(2)).collect()
var sa=sal2.map(salary => ""+salary(3)).collect()
var cn=sal2.map(cnt => ""+cnt(4)).collect()
//convert age to double
val ages= ag.map(_.toDouble)
//convert the cnt to integer
val counts= cn.map(_.toInt)
//length of the array
var cnt_length=counts.size
//create a table and add the sal2 records in it
val adlt2=spark.sqlContext.sql("CREATE TABLE adult3 (age double, edu string, sex string, salary string)")
//loop and enter the number of cn
var sql_querys="query"
var i=0
var j=0
var loop_cnt=0
for(i <-0 to cnt_length-1){
loop_cnt=counts(i)
for(j <-0 to loop_cnt-1){
sql_querys="INSERT into adult3 values ("+ages(i)+",'"+ed(i)+"','"+se(i)+"','"+sa(i)+"')"
val adlt3=spark.sqlContext.sql("INSERT into adult3 values ("+ages(i)+",'"+ed(i)+"','"+se(i)+"','"+sa(i)+"')")
}
}
主要部分是代码末尾的循环。
答案 0 :(得分:2)
这是一个仅使用rdds的简短解决方案:
val result = sc
.textFile("hdfs://NameNode01:9000/input/adult.csv")
.map({ (line: String) =>
val p = line.split(",")
(Rating(p(0).trim.toDouble,p(1),p(2),p(3)), 1)
})
.reduceByKey(_ + _)
.filter(_._2 > 2)
.flatMap(rating => Array.fill(rating._2)(rating._1))
它的工作原理如下:
textfile
从文件中加载rdd map
将这些行转换为(rating, 1)
reduceByKey
按rating
对对进行分组并对1进行求和(即计算每个评分的出现次数)filter
会丢弃少于3次flatmap
重复每个评分的次数,然后将所有结果展平为单个rdd 以下是初始方法不具备的一些原因:
collect
来读取本地计算机上的内容。这意味着您将直接失去spark的所有并行化和集群优势。for
循环执行对数据帧的单个插入。火花对象的可用转换(例如map
,filter
,reduce
,单个sql查询)被高度优化以便以分布式方式执行这些动作。通过使用for循环来执行单行操作,您将失去这一优势,另外,您可能会遇到在循环中每次迭代期间复制的数据帧的极端开销。答案 1 :(得分:1)
您可能需要考虑根据explode
计数使用groupBy
对数据框进行取消分组:
import org.apache.spark.sql.functions._
case class Rating(age: Double, edu: String, sex: String, salary: String)
val Result = sc.textFile("/Users/leo/projects/spark/files/testfile.csv").
map(_.split(",")).
map(p => Rating(p(0).trim.toDouble, p(1).trim, p(2).trim, p(3).trim)).
toDF
val saDF1 = Result.groupBy("age", "edu", "sex", "salary").agg(count("*") as "cnt")
val saDF2 = Result.groupBy("age", "edu", "sex", "salary").agg(count("*") as "cnt").where($"cnt" > 2)
// Create a UDF to fill array of 1's to be later exploded
val fillArr = (n: Int) => Array.fill(n)(1)
val fillArrUDF = udf(fillArr)
val expandedDF1 = saDF1.withColumn("arr", fillArrUDF($"cnt"))
expandedDF1.show
+----+------+---------+------+---+---------+
| age| edu| sex|salary|cnt| arr|
+----+------+---------+------+---+---------+
|33.0| Male|Bachelors| <=50K| 1| [1]|
|20.0|Female| Masters| >50K| 2| [1, 1]|
|53.0| Male| 11th| <=50K| 3|[1, 1, 1]|
+----+------+---------+------+---+---------+
// Ungroup dataframe using explode
val ungroupedDF1 = expandedDF1.withColumn("a", explode($"arr")).
select("age", "edu", "sex", "salary")
ungroupedDF1.show
+----+------+---------+------+
| age| edu| sex|salary|
+----+------+---------+------+
|33.0| Male|Bachelors| <=50K|
|20.0|Female| Masters| >50K|
|20.0|Female| Masters| >50K|
|53.0| Male| 11th| <=50K|
|53.0| Male| 11th| <=50K|
|53.0| Male| 11th| <=50K|
+----+------+---------+------+
答案 2 :(得分:1)
根据我从您的问题中所理解的,您想要过滤掉大于2的类似记录并写入文件。如果这样可以成为您的解决方案。
您必须已将原始数据框设为
+----+------+---------+------+
|age |edu |sex |salary|
+----+------+---------+------+
|53.0|Male |11th |<=50K |
|53.0|Male |11th |<=50K |
|53.0|Male |11th |<=50K |
|20.0|Female|Masters |>50K |
|20.0|Female|Masters |>50K |
|33.0|Male |Bachelors|<=50K |
+----+------+---------+------+
您不需要编写复杂的SQL查询来查找计数,您只需使用内置函数
val columnNames = Result.columns
val finalTemp = Result.groupBy(columnNames.map(col): _*).agg(count("salary").as("similar records"))
这应该输出为
+----+------+---------+------+---------------+
|age |edu |sex |salary|similar records|
+----+------+---------+------+---------------+
|33.0|Male |Bachelors|<=50K |1 |
|20.0|Female|Masters |>50K |2 |
|53.0|Male |11th |<=50K |3 |
+----+------+---------+------+---------------+
现在要过滤,你可以只使用过滤功能
val finalTable = finalTemp.filter($"similar records" < 3)
最终输出是
+----+------+---------+------+---------------+
|age |edu |sex |salary|similar records|
+----+------+---------+------+---------------+
|33.0|Male |Bachelors|<=50K |1 |
|20.0|Female|Masters |>50K |2 |
+----+------+---------+------+---------------+
您可以将其保存到文件
finalTable.write.format("com.databricks.spark.csv").save("output path")
如果您想要过滤掉原始数据,那么您只需使用join作为
Result.join(finalTable, Seq(columnNames: _*)).show(false)
输出
+----+------+---------+------+---------------+
|age |edu |sex |salary|similar records|
+----+------+---------+------+---------------+
|33.0|Male |Bachelors|<=50K |1 |
|20.0|Female|Masters |>50K |2 |
|20.0|Female|Masters |>50K |2 |
+----+------+---------+------+---------------+
您可以将其保存到上面的文件中
注意:您需要执行以下导入才能使用上述功能
import org.apache.spark.sql.functions._