我一直在寻找有关如何基于标识符为#@@#
的列值拆分数据帧的解决方案。
当组合为两个对时,即srcip#@@#destip
或destip#@@#srcip
,我得到了预期的结果,但是我对存储桶还有另一个要求,其中三个值的组合由#@@#
标识符分隔。
+--------------------------+---------------------------------------+
| attribute_name | attribute_value |
+--------------------------+---------------------------------------+
| Dest_IP#@@#Src_IP#@@#URL | 40.100.28.210#@@#10.26.195.182#@@#abc |
| Src_IP#@@#Dest_IP#@@#URL | 10.26.195.182#@@#40.100.28.210#@@#xyz |
| URL#@@#Src_IP#@@#Dest_IP | def#@@#40.100.28.210#@@#10.26.195.182 |
+--------------------------+---------------------------------------+
//below code is for combination of two elements
val identifierDf = whitelistingDf.filter(Df("attribute_name").contains("#@@#"))
if (Try(identifierDf.head()).isSuccess) {
//split attribute_name and attribute_value
val Df1 = identifierDf.select(expr("(split(attribute_name, '#@@#'))[0]").cast("string").as("attribute_name1"), expr("(split(attribute_name, '#@@#'))[1]").cast("string").as("attribute_name2"), expr("(split(attribute_value, '#@@#'))[0]").cast("string").as("attribute_value1"), expr("(split(attribute_value, '#@@#'))[1]").cast("string").as("attribute_value2"))
val elements = Df1.select("attribute_name1").map(r => r(0).asInstanceOf[String]).collect()
if (elements.contains("Src_IP")) {
mainDF= foo(Df1, "src_ip", "Src_IP")
}
if (elements.contains("Dest_IP")) {
mainDF= foo(Df1, "dst_ip", "Dest_IP")
}
}
以下是我想要的且具有相同列顺序的预期结果。
+---------------+---------------+--------------------+--------------------+----------------+-------------------+
|attribute_name1|attribute_name2| attribute_name3| attribute_value1|attribute_value2| attribute_value3|
+---------------+---------------+--------------------+--------------------+----------------+-------------------+
| Src_IP| Dest_IP| URL | 10.26.195.182| 40.100.28.210| abc |
| Src_IP| Dest_IP| URL | 10.26.195.182| 40.100.28.210| xyz |
| Src_IP| Dest_IP| URL | 10.26.195.182| 40.100.28.210| def |
+---------------+---------------+--------------------+--------------------+----------------+-------------------+
感谢@philan我能够使用以下逻辑做到这一点,
var count = 0
for(name <- colname){
println(s"$count :: $name")
if(count < 3)
{
name match{
case "Src_IP" => identifierDf2 = identifierDf2.withColumn("attribute_name1", split(col("attribute_name"), "\\#@@#").getItem(count))
.withColumn("attribute_value1", split(col("attribute_value"), "\\#@@#").getItem(count))
case "Dest_IP" => identifierDf2 = identifierDf2.withColumn("attribute_name2", split(col("attribute_name"), "\\#@@#").getItem(count))
.withColumn("attribute_value2", split(col("attribute_value"), "\\#@@#").getItem(count))
case "URL" => identifierDf2 = identifierDf2.withColumn("attribute_name3", split(col("attribute_name"), "\\#@@#").getItem(count))
.withColumn("attribute_value3", split(col("attribute_value"), "\\#@@#").getItem(count))
}
}
}
count = count + 1
}