如何根据具有相同顺序的标识符的列值拆分数据帧

时间:2019-06-20 07:47:52

标签: apache-spark hadoop apache-spark-sql

我一直在寻找有关如何基于标识符为#@@#的列值拆分数据帧的解决方案。

当组合为两个对时,即srcip#@@#destipdestip#@@#srcip,我得到了预期的结果,但是我对存储桶还有另一个要求,其中三个值的组合由#@@#标识符分隔。

+--------------------------+---------------------------------------+
| attribute_name           | attribute_value                       |
+--------------------------+---------------------------------------+
| Dest_IP#@@#Src_IP#@@#URL | 40.100.28.210#@@#10.26.195.182#@@#abc |
| Src_IP#@@#Dest_IP#@@#URL | 10.26.195.182#@@#40.100.28.210#@@#xyz |
| URL#@@#Src_IP#@@#Dest_IP | def#@@#40.100.28.210#@@#10.26.195.182 |
+--------------------------+---------------------------------------+


//below code is for combination of two elements

val identifierDf = whitelistingDf.filter(Df("attribute_name").contains("#@@#"))

        if (Try(identifierDf.head()).isSuccess) {

          //split attribute_name and attribute_value
          val Df1 = identifierDf.select(expr("(split(attribute_name, '#@@#'))[0]").cast("string").as("attribute_name1"), expr("(split(attribute_name, '#@@#'))[1]").cast("string").as("attribute_name2"), expr("(split(attribute_value, '#@@#'))[0]").cast("string").as("attribute_value1"), expr("(split(attribute_value, '#@@#'))[1]").cast("string").as("attribute_value2"))

          val elements = Df1.select("attribute_name1").map(r => r(0).asInstanceOf[String]).collect()

          if (elements.contains("Src_IP")) {

            mainDF= foo(Df1, "src_ip", "Src_IP")

          }
          if (elements.contains("Dest_IP")) {

            mainDF= foo(Df1, "dst_ip", "Dest_IP")

          }
        }

以下是我想要的且具有相同列顺序的预期结果。

+---------------+---------------+--------------------+--------------------+----------------+-------------------+
|attribute_name1|attribute_name2|     attribute_name3|    attribute_value1|attribute_value2|   attribute_value3|
+---------------+---------------+--------------------+--------------------+----------------+-------------------+
|         Src_IP|        Dest_IP|              URL   |       10.26.195.182|   40.100.28.210|   abc             |
|         Src_IP|        Dest_IP|              URL   |       10.26.195.182|   40.100.28.210|   xyz             |
|         Src_IP|        Dest_IP|              URL   |       10.26.195.182|   40.100.28.210|   def             |
+---------------+---------------+--------------------+--------------------+----------------+-------------------+

感谢@philan我能够使用以下逻辑做到这一点,

          var count = 0
          for(name <- colname){

          println(s"$count :: $name")

          if(count < 3)
          {
          name match{

            case "Src_IP" =>  identifierDf2 = identifierDf2.withColumn("attribute_name1", split(col("attribute_name"), "\\#@@#").getItem(count))
                                                                 .withColumn("attribute_value1", split(col("attribute_value"), "\\#@@#").getItem(count))

            case "Dest_IP" => identifierDf2 = identifierDf2.withColumn("attribute_name2", split(col("attribute_name"), "\\#@@#").getItem(count))
                                                                 .withColumn("attribute_value2", split(col("attribute_value"), "\\#@@#").getItem(count))

            case "URL" => identifierDf2 = identifierDf2.withColumn("attribute_name3", split(col("attribute_name"), "\\#@@#").getItem(count))
                                                                 .withColumn("attribute_value3", split(col("attribute_value"), "\\#@@#").getItem(count)) 

            }
          }
        }
        count = count + 1
     }

0 个答案:

没有答案