将String替换为包含String的List

时间:2017-11-03 11:33:38

标签: scala apache-spark dataframe

我有一个具有以下结构的Dataframe:

(List[String], String)

两行的例子可以是:([a,b,c],d)和([d,e],a) 我想将这些行转换为([a,b,c],[d,e])和([d,e],[a,b,c])

数据框的列名是“src”和“dst”。

我该如何处理这个问题?

我尝试了什么:

val result = df.map(f => {
  if(df.exists(x => x._1.contains(f._2))) {
    (f._1, df.filter(x => x._1.contains(f._2)).head._1)
  } else {
    (f._1, List(f._2))
  }
}).toDF("src", "dst")

但是,此解决方案给出了以下错误:

  

java.lang.IllegalStateException:未读块数据           at java.io.ObjectInputStream $ BlockDataInputStream.setBlockDataMode(ObjectInputStream.java:2740)           在java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1567)           at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2245)           at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2169)           at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2027)           在java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535)           at java.io.ObjectInputStream.readObject(ObjectInputStream.java:422)           在org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:75)           在org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:114)           在org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:85)           在org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)           在org.apache.spark.scheduler.Task.run(Task.scala:108)           在org.apache.spark.executor.Executor $ TaskRunner.run(Executor.scala:335)           在java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)           at java.util.concurrent.ThreadPoolExecutor $ Worker.run(ThreadPoolExecutor.java:624)           在java.lang.Thread.run(Thread.java:748)

必须有更有效的方法吗?

1 个答案:

答案 0 :(得分:0)

据我从上述问题和评论中了解,以下可以成为您的解决方案

将输入dataframe视为

+---------+---+
|src      |dst|
+---------+---+
|[a, b, c]|d  |
|[d, e]   |a  |
+---------+---+

您可以将joinudf功能用作

import org.apache.spark.sql.functions._
val joinExpr = udf((col1: mutable.WrappedArray[String], col2: String) =>  col1.contains(col2))

df.as("t1").join(df.as("t2"), joinExpr($"t1.src", $"t2.dst")).select($"t1.src".as("src"), $"t2.src".as("dst")).show(false)

将最终输出作为

+---------+---------+
|src      |dst      |
+---------+---------+
|[a, b, c]|[d, e]   |
|[d, e]   |[a, b, c]|
+---------+---------+

希望答案有用