我需要在spark scala中创建自定义特征转换器。例如,我有一个scala数据框
+--------------------+ .
| email_list| .
+--------------------+ .
|testmail1115@gmail.com| .
|mavenmaven@mlail.com| .
|dnd.7899334622@gmail.com| .
+--------------------+ .
如果我使用转换器,它将输入的字符串数组转换为n元语法的数组,如下所示:
+--------------------+--------------------+
| email_list| ngrams| .
+--------------------+--------------------+
|testmail1115@gmail.com|[t e, e s, s t, t...|
|mavenmaven@mlail.com|[m a, a v, v e, e...| .
|dnd.7899334622@gmail.com|[d n, n d, d...| .
+--------------------+--------------------+ .
如何在下面的代码中显示不同的ngram而不是模式或数组:
import org.apache.spark.ml.feature.NGram
val emailD1F=emailDF.withColumn("email_split", split(col("email_list"), "@").getItem(0)).withColumn("email_split", split(col("email_split"), "")) .
val ngram = new NGram().setN(2).setInputCol("col1").setOutputCol("ngrams")
val ngramDataFrame = ngram.transform(emailD1F)
ngramDataFrame.show()