如何在Spark中创建一组ngrams?

时间:2018-01-26 12:07:21

标签: scala apache-spark

我使用Scala从Spark 2.2数据帧列中提取Ngrams,因此(本例中为三元组):

val ngram = new NGram().setN(3).setInputCol("incol").setOutputCol("outcol")

如何创建包含1到5克的输出列?所以它可能是这样的:

val ngram = new NGram().setN(1:5).setInputCol("incol").setOutputCol("outcol")

但这不起作用。 我可以遍历N并为N的每个值创建新的数据帧,但这似乎效率低下。任何人都能指出我正确的方向,因为我的Scala很啰嗦吗?

1 个答案:

答案 0 :(得分:4)

如果您想将这些组合成矢量,可以按Python answer重写zero323

rails c

结果

import org.apache.spark.ml.feature._
import org.apache.spark.ml.Pipeline

def buildNgrams(inputCol: String = "tokens", 
                 outputCol: String = "features", n: Int = 3) = {

  val ngrams = (1 to n).map(i =>
      new NGram().setN(i)
        .setInputCol(inputCol).setOutputCol(s"${i}_grams")
  )

  val vectorizers = (1 to n).map(i =>
     new CountVectorizer()
      .setInputCol(s"${i}_grams")
      .setOutputCol(s"${i}_counts")
  )

  val assembler = new VectorAssembler()
    .setInputCols(vectorizers.map(_.getOutputCol).toArray)
    .setOutputCol(outputCol)

  new Pipeline().setStages((ngrams ++ vectorizers :+ assembler).toArray)

}

val df = Seq((1, Seq("a", "b", "c", "d"))).toDF("id", "tokens")

使用udf更简单:

buildNgrams().fit(df).transform(df).show(1, false)
// +---+------------+------------+---------------+--------------+-------------------------------+-------------------------+-------------------+-------------------------------------+
// |id |tokens      |1_grams     |2_grams        |3_grams       |1_counts                       |2_counts                 |3_counts           |features                             |
// +---+------------+------------+---------------+--------------+-------------------------------+-------------------------+-------------------+-------------------------------------+
// |1  |[a, b, c, d]|[a, b, c, d]|[a b, b c, c d]|[a b c, b c d]|(4,[0,1,2,3],[1.0,1.0,1.0,1.0])|(3,[0,1,2],[1.0,1.0,1.0])|(2,[0,1],[1.0,1.0])|[1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0]|
// +---+------------+------------+---------------+--------------+-------------------------------+-------------------------+-------------------+-------------------------------------+