Question

我正在使用Apache Spark 2来标记某些文本。

Dataset<Row> regexTokenized = regexTokenizer.transform(data);

返回String of String。

Dataset<Row> words = regexTokenized.select("words");

示例数据如下所示。

+--------------------+
|               words|
+--------------------+
|[very, caring, st...|
|[the, grand, cafe...|
|[i, booked, a, no...|
|[wow, the, places...|
|[if, you, are, ju...|

现在，我希望得到所有独特的单词。我尝试了几个过滤器，flatMap，map函数和reduce。我无法理解，因为我是Spark的新手。

Answer 1

根据@Haroun Mohammedi的回答，我能够用Java来解决这个问题。

Dataset<Row> uniqueWords = regexTokenized.select(explode(regexTokenized.col("words"))).distinct();
uniqueWords.show();

Answer 2

我来自scala，但我确实相信Java中有类似的方式。

我认为在这种情况下，您必须使用explode方法将数据转换为Dataset个单词。

此代码应为您提供所需的结果：

import org.apache.spark.sql.functions.explode
val dsWords = regexTokenized.select(explode("words"))
val dsUniqueWords = dsWords.distinct()

有关爆炸方法的信息，请参阅official documentation

希望它有所帮助。

从Java中的Spark Dataset获取独特的单词

2 个答案: