我有一个大数据框,长700万行,我需要添加一列来计算某个人(由和Integer标识)来过多少次,例如:
| Reg | randomdata |
| 123 | yadayadayada |
| 246 | yedayedayeda |
| 123 | yadeyadeyade |
|369 | adayeadayead |
| 123 | yadyadyadyad |
至->
| Reg | randomdata | count
| 123 | yadayadayada | 1
| 246 | yedayedayeda | 1
| 123 | yadeyadeyade | 2
| 369 | adayeadayead | 1
| 123 | yadyadyadyad | 3
我已经做了一个分组,以了解每次重复的次数,但是我需要获得一次机器学习练习的次数,以便根据之前发生的次数来获得重复的概率。
答案 0 :(得分:0)
您可以这样做
def countrds = udf((rds: Seq[String]) => {rds.length})
val df2 = df1.groupBy(col("Reg")).agg(collect_list(col("randomdata")).alias("rds"))
.withColumn("count", countrds(col("rds")))
df2.select('Reg', 'randomdata', 'count').show()
答案 1 :(得分:0)
以下我们假设随机性可能意味着出现相同的随机值,并且使用带有tempview的spark sql,但是也可以使用带有select的DF来完成:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window._
case class xyz(k: Int, v: String)
val ds = Seq(
xyz(1,"917799423934"),
xyz(2,"019331224595"),
xyz(3,"8981251522"),
xyz(3,"8981251522"),
xyz(4,"8981251522"),
xyz(1,"8981251522"),
xyz(1,"uuu4553")).toDS()
ds.createOrReplaceTempView("XYZ")
spark.sql("""select z.k, z.v, dense_rank() over (partition by z.k order by z.seq) as seq from (select k,v, row_number() over (order by k) as seq from XYZ) z""").show
返回:
+---+------------+---+
| k| v|seq|
+---+------------+---+
| 1|917799423934| 1|
| 1| 8981251522| 2|
| 1| uuu4553| 3|
| 2|019331224595| 1|
| 3| 8981251522| 1|
| 3| 8981251522| 2|
| 4| 8981251522| 1|
+---+------------+---+