Question

我正在寻找一种标准或首选方式来执行此操作：我有许多文本文件，每种格式都是：

文件1：

word1 tag1 0.5
word1 tag2 0.4
word2 tag2 0.6

file2的：

word1 tag3 0.7
word3 tag1 0.9
word2 tag2 0.3

如您所见，单词可能包含多个＆＃34;标记＆＃34; s，在这种情况下，我只需要保留每个文件中得分最高的那个。这样

word1 tag2 0.4

已删除。

预期结果：

word1 tag1 0.5
word1 tag3 0.7
word2 tag2 0.6
word3 tag1 0.9
word2 tag2 0.3 //keep this because it is from file2

我知道我可以将每个文件作为单独的RDD读取，然后排序和合并/连接以产生我的结果，但有更好的方法吗？例如使用

一次提供所有输入文件

ctx.textFile(String.join(",", myfiles)); // myfiles = [file1, file2]

谢谢，

Answer 1

您需要做两件事：将文件名添加到数据中，然后找到相关的最大值。

val df = spark.read.options(Map("sep"->" ")).csv("/file*").withColumn("filename",input_file_name())

这导致：

scala> df.show()
+-----+----+---+--------------------+
|  _c0| _c1|_c2|            filename|
+-----+----+---+--------------------+
|word1|tag1|0.5|hdfs://hmaster:90...|
|word1|tag2|0.4|hdfs://hmaster:90...|
|word2|tag2|0.6|hdfs://hmaster:90...|
|word1|tag3|0.7|hdfs://hmaster:90...|
|word3|tag1|0.9|hdfs://hmaster:90...|
|word2|tag2|0.3|hdfs://hmaster:90...|
+-----+----+---+--------------------+

现在我们只想保留相关的，我们可以使用argmax（参见https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/1023043053387187/3170497669323442/2840265927289860/latest.html）来做到这一点：

import org.apache.spark.sql.functions._
val targetDF = df.groupBy($"_c0", $"filename").agg(max(struct('_c2, '_c1)) as "tmp").select($"_c0", $"tmp.*")

结果：

scala> targetDF.show()
+-----+---+----+
|  _c0|_c2| _c1|
+-----+---+----+
|word2|0.3|tag2|
|word1|0.7|tag3|
|word2|0.6|tag2|
|word1|0.5|tag1|
|word3|0.9|tag1|
+-----+---+----+

你应该重命名当然的列

spark删除每个文本文件中的重复项

1 个答案: