Apache Spark ML Pipeline:过滤数据集中的空行

时间:2018-11-19 08:46:10

标签: scala apache-spark apache-spark-sql apache-spark-mllib apache-spark-ml

在我的Spark ML管道(Spark 2.3.0)中,我像这样使用 FOR EXECUTE STATEMENT ('SELECT ... WHERE CURRENT_TIMESTAMP >= tablename.modifiedon') ON EXTERNAL 'SERVER/PORT:DBPATH'

RegexTokenizer

它将val regexTokenizer = new RegexTokenizer() .setInputCol("text") .setOutputCol("words") .setMinTokenLength(3) 转换为带有单词数组的单词,例如:

DataFrame

如何使用空text | words ------------------------- a the | [the] a of to | [] big small | [big,small] 数组过滤行? 我应该创建自定义转换器并将其传递到管道吗?

2 个答案:

答案 0 :(得分:1)

您可以使用SQLTransformer

import org.apache.spark.ml.feature.SQLTransformer

val emptyRemover = new SQLTransformer().setStatement(
  "SELECT * FROM __THIS__ WHERE size(words) > 0"
)

可以直接应用

val df = Seq(
  ("a the", Seq("the")), ("a of the", Seq()), 
  ("big small", Seq("big", "small"))
).toDF("text", "words")

emptyRemover.transform(df).show
+---------+------------+
|     text|       words|
+---------+------------+
|    a the|       [the]|
|big small|[big, small]|
+---------+------------+

或在Pipeline中使用。

尽管如此,我还是会在Spark ML流程中使用它之前考虑两次。通常在下游使用的工具(例如CountVectorizer)可以很好地处理空输入:

import org.apache.spark.ml.feature.CountVectorizer

val vectorizer = new CountVectorizer()
  .setInputCol("words")
  .setOutputCol("features")
+---------+------------+-------------------+                 
|     text|       words|           features|
+---------+------------+-------------------+
|    a the|       [the]|      (3,[2],[1.0])|
| a of the|          []|          (3,[],[])|
|big small|[big, small]|(3,[0,1],[1.0,1.0])|
+---------+------------+-------------------+

并且缺少某些单词,通常可以提供有用的信息。

答案 1 :(得分:-1)

df
  .select($text, $words)
  .where(size($words) > 0)