创建一个计数火花数据帧中重复次数的列

时间:2018-09-04 22:42:16

标签: apache-spark dataframe count

我有一个大数据框,长700万行,我需要添加一列来计算某个人(由和Integer标识)来过多少次,例如:

| Reg |   randomdata |                   
| 123 | yadayadayada |                 
| 246 | yedayedayeda |          
| 123 | yadeyadeyade |                 
|369  | adayeadayead |                
| 123 | yadyadyadyad |  

至->

| Reg |   randomdata |     count              
| 123 | yadayadayada |          1       
| 246 | yedayedayeda |          1  
| 123 | yadeyadeyade |          2      
| 369 | adayeadayead |          1      
| 123 | yadyadyadyad |          3

我已经做了一个分组,以了解每次重复的次数,但是我需要获得一次机器学习练习的次数,以便根据之前发生的次数来获得重复的概率。

2 个答案:

答案 0 :(得分:0)

您可以这样做

 def countrds = udf((rds: Seq[String]) => {rds.length})
 val df2 = df1.groupBy(col("Reg")).agg(collect_list(col("randomdata")).alias("rds"))
                    .withColumn("count", countrds(col("rds")))
 df2.select('Reg', 'randomdata', 'count').show()

答案 1 :(得分:0)

以下我们假设随机性可能意味着出现相同的随机值,并且使用带有tempview的spark sql,但是也可以使用带有select的DF来完成:

import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window._

case class xyz(k: Int, v: String)
val ds = Seq(
     xyz(1,"917799423934"),
     xyz(2,"019331224595"),
     xyz(3,"8981251522"),
     xyz(3,"8981251522"),
     xyz(4,"8981251522"),
     xyz(1,"8981251522"),
     xyz(1,"uuu4553")).toDS()

 ds.createOrReplaceTempView("XYZ")

spark.sql("""select z.k, z.v, dense_rank() over (partition by z.k order by z.seq) as seq from (select k,v, row_number() over (order by k) as seq from XYZ) z""").show

返回:

+---+------------+---+
|  k|           v|seq|
+---+------------+---+
|  1|917799423934|  1|
|  1|  8981251522|  2|
|  1|     uuu4553|  3|
|  2|019331224595|  1|
|  3|  8981251522|  1|
|  3|  8981251522|  2|
|  4|  8981251522|  1|
+---+------------+---+