在Spark窗口中,当所选行数小于窗口大小时,如何填充null?

时间:2017-08-17 01:04:55

标签: scala apache-spark

假设有一个数据帧如下:

machine_id  | value
 1| 5
 1| 3
 1| 4

我想生成像这样的最终数据框

machine_id  | value | sum 
 1| 5|null 
 1| 3| 8
 1| 4| 7

基本上我必须做一个大小为2的窗口,但是对于第一行,我们不想总结为零。它只会被填充为null。

var winSpec = Window.orderBy("machine_id  ").partitionBy("machine_id  ").rangeBetween(-1, 0)
df.withColumn("sum",  sum("value").over(winSpec))

2 个答案:

答案 0 :(得分:1)

您可以使用lag功能,使用滞后(值,1)添加列:

val df = Seq((1,5),(1,3),(1,4)).toDF("machine_id", "value")
import org.apache.spark.sql.expressions.Window
val window = Window.partitionBy("machine_id").orderBy("id")

(df.withColumn("id", monotonically_increasing_id)
   .withColumn("sum", $"value" + lag($"value",1).over(window))
   .drop("id").show())

+----------+-----+----+
|machine_id|value| sum|
+----------+-----+----+
|         1|    5|null|
|         1|    3|   8|
|         1|    4|   7|
+----------+-----+----+

答案 1 :(得分:0)

您应该使用try { var objectSearcher = new ManagementObjectSearcher("root\\StandardCimv2", $@"select Name, InterfaceName, InterfaceType, NdisPhysicalMedium from MSFT_NetAdapter where ConnectorPresent=1"); //Physical adapter int count = 0; foreach (var managementObject in objectSearcher.Get()) { //The locally unique identifier for the network interface. in InterfaceType_NetluidIndex format. Ex: Ethernet_2. string interfaceName = managementObject["InterfaceName"]?.ToString(); //The interface type as defined by the Internet Assigned Names Authority (IANA). //https://www.iana.org/assignments/ianaiftype-mib/ianaiftype-mib UInt32 interfaceType = Convert.ToUInt32(managementObject["InterfaceType"]); //The types of physical media that the network adapter supports. UInt32 ndisPhysicalMedium = Convert.ToUInt32(managementObject["NdisPhysicalMedium"]); if (!string.IsNullOrEmpty(interfaceName) && interfaceType == 6 && //ethernetCsmacd(6) --for all ethernet-like interfaces, regardless of speed, as per RFC3635 (ndisPhysicalMedium == 0 || ndisPhysicalMedium == 14)) //802.3 { count++; } } return count; } catch (ManagementException) { //Run-time requirements WMI MSFT_NetAdapter class is included in Windows 8 and Windows Server 2012 } 而不是rowsBetween api,如下所示

rangeBetween

哪个应该能给你预期的结果

import org.apache.spark.sql.functions._
var winSpec = Window.orderBy("machine_id").partitionBy("machine_id").rowsBetween(-1, 0)
df.withColumn("sum",  sum("value").over(winSpec))
  .withColumn("sum", when($"sum" === $"value", null).otherwise($"sum"))
  .show(false)

我希望答案很有帮助