错误"窗口功能不支持"当在scala中的Spark中执行的数据帧列上操作时发生

时间:2017-07-04 15:26:25

标签: scala apache-spark dataframe

我有以下原始数据,我需要清理它:

 03:35:20.299037 IP 10.0.0.1 > 10.0.0.2: ICMP echo request, id 8321, seq 17, length 64
    03:35:20.327290 IP 10.0.0.1 > 10.0.0.3: ICMP echo reply, id 8321, seq 17, length 64
    03:35:20.330845 IP 10.0.0.1 > 10.0.0.3: ICMP echo request, id 8311, seq 19, length 64
 03:35:20.330892 IP 10.0.0.1 > 10.0.0.3: ICMP echo request, id 8321, seq 17, length 64
    03:35:20.330918 IP 10.0.0.1 > 10.0.0.3: ICMP echo reply, id 8321, seq 17, length 64
    03:35:20.330969 IP 10.0.0.1 > 10.0.0.4: ICMP echo request, id 8311, seq 19, length 64
  03:35:20.331041 IP 10.0.0.1 > 10.0.0.4: ICMP echo request, id 8311, seq 19, length 64

我需要输出以下内容:

+---------------+-----------+-------------+----+------+---+----+
|   time_stamp_0|sender_ip_1|receiver_ip_2|rank|count | xi|pi  |
+---------------+-----------+-------------+----+------+---+----+
|03:35:20.299037|   10.0.0.1|     10.0.0.2|   1|     7| 1 |0.14|  
|03:35:20.327290|   10.0.0.1|     10.0.0.3|   1|     7| 4 |0.57|
|03:35:20.330845|   10.0.0.1|     10.0.0.3|   2|     7| 4 |0.57|
|03:35:20.330892|   10.0.0.1|     10.0.0.3|   3|     7| 4 |0.57|
|03:35:20.330918|   10.0.0.1|     10.0.0.3|   4|     7| 4 |0.57|
|03:35:20.330969|   10.0.0.1|     10.0.0.4|   1|     7| 2 |0.28|
|03:35:20.331041|   10.0.0.1|     10.0.0.4|   2|     7| 2 |0.28|

根据上述数据框架,我需要将每个源IP和目标IP的重复总数设为" xi"列,总行数为" count"列和xi / count的划分为" pi"柱。当我想计算xi时我的问题就会出现,我得到以下错误:

   Exception in thread "main" org.apache.spark.sql.AnalysisException: Expression 'count#11L' not supported within a window function.;;
Project [time_stamp_0#3, sender_ip_1#4, receiver_ip_2#5, count#11L, rank#20, xi#45, pi#90]
+- Project [time_stamp_0#3, sender_ip_1#4, receiver_ip_2#5, count#11L, rank#20, xi#45, _we0#109L, (cast(xi#45 as double) / cast(_we0#109L as double)) AS pi#90]
   +- Window [count#11L windowspecdefinition(sender_ip_1#4, receiver_ip_2#5, ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS _we0#109L], [sender_ip_1#4, receiver_ip_2#5]

    ...

我有以下代码:

      val customSchema = StructType(Array(
      StructField("time_stamp_0", StringType, true),
      StructField("sender_ip_1", StringType, true),
      StructField("receiver_ip_2", StringType, true)))

    ///////////////////////////////////////////////////make train dataframe
    val Dstream_Train = sc.textFile("/Users/saeedtkh/Desktop/sharedsaeed/Test/dsetcopy.txt")

    val Row_Dstream_Train = Dstream_Train.map(line => line.split(",")).map(array => {

      val array1 = array(0).trim.split("IP")
      val array2 = array1(1).split(">")
      val array3 = array2(1).split(":")

      val first = Try(array1(0).trim) getOrElse ""
      val second = Try(array2(0).trim) getOrElse ""
      val third = Try(array3(0)) getOrElse ""

      Row.fromSeq(Seq(first, second, third))
    })

    val Frist_Dataframe = session.createDataFrame(Row_Dstream_Train, customSchema)
    val columns1and2_count = Window.partitionBy("sender_ip_1", "receiver_ip_2")
    val columns1and2_rank = Window.partitionBy("sender_ip_1", "receiver_ip_2").orderBy("time_stamp_0")
    // <-- matches groupBy
    //Add rank() to df
    val Dataframe_add_rank = Frist_Dataframe.withColumn("count", count($"receiver_ip_2") over columns1and2_count).distinct()
    //Add cout to df
    val Dataframe_add_rank_count = Dataframe_add_rank.withColumn("rank", rank() over columns1and2_rank).distinct()// Dataframe.show()
    //Add x(i) to df
    val Dataframe_add_rank_count_xi = Dataframe_add_rank_count.withColumn("xi", rank() over columns1and2_rank).distinct()// Dataframe.show()
    //Add p(i)=maxrank(x(i))/count
    val Dataframe_add_rank_count_xi_pi = Dataframe_add_rank_count_xi.withColumn("pi", $"xi" / ($"count").over(columns1and2_count))// Dataframe.show()
  val std_dev=Dataframe_add_rank_count.agg(stddev_pop($"rank"))
    val stdDevValue = std_dev.head.getDouble(0)
    std_dev.show()
    //Add attack status

    val final_add_count_rank_xi_attack = Dataframe_add_rank_count_xi_pi.withColumn("attack", when($"rank" < stdDevValue , 0).otherwise(1))


Dataframe_add_rank_count_xi_pi.show()

更新 根据答案一,我得到了以下输出:

+---------------+-----------+-------------+-----+----+---+---+------+
|   time_stamp_0|sender_ip_1|receiver_ip_2|count|rank| xi| pi|attack|
+---------------+-----------+-------------+-----+----+---+---+------+
|03:35:20.330969|   10.0.0.1|     10.0.0.4|    2|   1|  4|2.0|     0|
|03:35:20.331041|   10.0.0.1|     10.0.0.4|    2|   2|  4|2.0|     1|
|03:35:20.299037|   10.0.0.1|     10.0.0.2|    1|   1|  4|4.0|     0|
|03:35:20.327290|   10.0.0.1|     10.0.0.3|    4|   1|  4|1.0|     0|
|03:35:20.330845|   10.0.0.1|     10.0.0.3|    4|   2|  4|1.0|     1|
|03:35:20.330892|   10.0.0.1|     10.0.0.3|    4|   3|  4|1.0|     1|
|03:35:20.330918|   10.0.0.1|     10.0.0.3|    4|   4|  4|1.0|     1|
+---------------+-----------+-------------+-----+----+---+---+------+

所以xi不正确。 :( 你能帮助我吗?提前致谢。

1 个答案:

答案 0 :(得分:2)

如果您希望使用xi计数填充max列。如你所说

  
    

每个源和目标IP的最大重复次数为“xi”列

  

然后你应该做

val maxForXI = Dataframe_add_rank_count.agg(max("rank")).first.get(0)
val Dataframe_add_rank_count_xi = Dataframe_add_rank_count.withColumn("xi", lit(maxForXI))

你应该得到

+---------------+-----------+-------------+-----+----+---+
|time_stamp_0   |sender_ip_1|receiver_ip_2|count|rank|xi |
+---------------+-----------+-------------+-----+----+---+
|03:35:20.330969|10.0.0.1   | 10.0.0.4    |2    |1   |4  |
|03:35:20.331041|10.0.0.1   | 10.0.0.4    |2    |2   |4  |
|03:35:20.299037|10.0.0.1   | 10.0.0.2    |1    |1   |4  |
|03:35:20.327290|10.0.0.1   | 10.0.0.3    |4    |1   |4  |
|03:35:20.330845|10.0.0.1   | 10.0.0.3    |4    |2   |4  |
|03:35:20.330892|10.0.0.1   | 10.0.0.3    |4    |3   |4  |
|03:35:20.330918|10.0.0.1   | 10.0.0.3    |4    |4   |4  |
+---------------+-----------+-------------+-----+----+---+

您说的最后一栏pi

  
    

将xi / count划分为“pi”列

  

所以你应该这样做

val Dataframe_add_rank_count_xi_pi = Dataframe_add_rank_count_xi.withColumn("pi", $"xi" / $"count")

你应该有所需的结果

+---------------+-----------+-------------+-----+----+---+---+
|   time_stamp_0|sender_ip_1|receiver_ip_2|count|rank| xi| pi|
+---------------+-----------+-------------+-----+----+---+---+
|03:35:20.330969|   10.0.0.1|     10.0.0.4|    2|   1|  4|2.0|
|03:35:20.331041|   10.0.0.1|     10.0.0.4|    2|   2|  4|2.0|
|03:35:20.299037|   10.0.0.1|     10.0.0.2|    1|   1|  4|4.0|
|03:35:20.327290|   10.0.0.1|     10.0.0.3|    4|   1|  4|1.0|
|03:35:20.330845|   10.0.0.1|     10.0.0.3|    4|   2|  4|1.0|
|03:35:20.330892|   10.0.0.1|     10.0.0.3|    4|   3|  4|1.0|
|03:35:20.330918|   10.0.0.1|     10.0.0.3|    4|   4|  4|1.0|
+---------------+-----------+-------------+-----+----+---+---+

<强>被修改

由于编辑了问题,最终编辑的答案如下,(这些是定义窗口函数后的代码)

//Add rank() to df
val cnt = Frist_Dataframe.count
val Dataframe_add_rank = Frist_Dataframe.withColumn("count", lit(cnt))
//Add cout to df
val Dataframe_add_rank_count = Dataframe_add_rank.withColumn("rank", rank() over columns1and2_rank).distinct()// Dataframe.show()
//Add x(i) to df
val Dataframe_add_rank_count_xi = Dataframe_add_rank_count.withColumn("xi", count($"receiver_ip_2") over columns1and2_count)// Dataframe.show()
//Add p(i)=maxrank(x(i))/count
val Dataframe_add_rank_count_xi_pi = Dataframe_add_rank_count_xi.withColumn("pi", $"xi" / $"count")// Dataframe.show()

val std_dev=Dataframe_add_rank_count.agg(stddev_pop($"rank"))
val stdDevValue = std_dev.head.getDouble(0)
std_dev.show()
//Add attack status

val final_add_count_rank_xi_attack = Dataframe_add_rank_count_xi_pi.withColumn("attack", when($"rank" < stdDevValue , 0).otherwise(1))
final_add_count_rank_xi_attack.show()

我希望你这次能找到合适的桌子。