我有一个数据集,其中有序列号,例如0和1。
Category Value Sequences
1 10 0
1 11 1
1 13 1
1 16 1
1 20 0
1 21 0
1 22 1
1 25 1
1 27 1
1 29 1
1 30 0
1 32 1
1 34 1
1 35 1
1 38 0
此处“序列中的1”出现三次。我需要单独总结一下序列值。
我正在尝试使用以下代码:
%livy2.spark
import org.apache.spark.rdd.RDD
val df = df.select( $"Category", $"Value", $"Sequences").rdd.groupBy(x =>
(x.getInt(0))
).map(
x => {
val Category= x(0).getInt(0)
val Value= x(0).getInt(1)
val Sequences = x(0).getInt(2)
for (i <- x.indices){
val vi = x(i).getFloat(4)
if (vi(0) >0 )
{
summing+ = Value//
}
(Category, summing)
}
}
)
df_new.take(10).foreach(println)
当我编写此代码时,出现错误,指出该不完整的语句。 值df代表我最初给出的数据集。
预期输出为:
Category summing
1 40
1 103
1 101
我不知道我在哪里落后。如果有人帮助我学习这个新事物,那就太好了。
答案 0 :(得分:1)
可以通过分配每行唯一ID,然后将每个单元包括在下一个零唯一ID指定的组中来完成:
Y=np.abs(np.fft.fftshift(signal)
输出:
val df = Seq(
(1, 10, 0),
(1, 11, 1),
(1, 13, 1),
(1, 16, 1),
(1, 20, 0),
(1, 21, 0),
(1, 22, 1),
(1, 25, 1),
(1, 27, 1),
(1, 29, 1),
(1, 30, 0),
(1, 32, 1),
(1, 34, 1),
(1, 35, 1),
(1, 38, 0)
).toDF("Category", "Value", "Sequences")
// assign each row unique id
val zipped = df.withColumn("zip", monotonically_increasing_id())
// Make range from zero to next zero
val categoryWindow = Window.partitionBy("Category").orderBy($"zip")
val groups = zipped
.filter($"Sequences" === 0)
.withColumn("rangeEnd", lead($"zip", 1).over(categoryWindow))
.withColumnRenamed("zip", "rangeStart")
println("Groups:")
groups.show(false)
// Assign range for each unit
val joinCondition = ($"units.zip" > $"groups.rangeStart").and($"units.zip" < $"groups.rangeEnd")
val unitsByRange = zipped
.filter($"Sequences" === 1).alias("units")
.join(groups.alias("groups"), joinCondition, "left")
.select("units.Category", "units.Value", "groups.rangeStart")
println("Units in groups:")
unitsByRange.show(false)
// Group by range
val result = unitsByRange
.groupBy($"Category", $"rangeStart")
.agg(sum("Value").alias("summing"))
.orderBy("rangeStart")
.drop("rangeStart")
println("Result:")
result.show(false)