用java找到Bigrams的火花(8)

时间:2016-05-18 11:38:02

标签: java apache-spark java-8 apache-spark-mllib

我已经将句子标记为单词RDD。所以现在我需要Bigrams 恩。 This is my test => (This is), (is my), (my test)
为此,我搜索过并找到.sliding运算符。但是我没有在我的日食上得到这个选项(可能它适用于较新版本的spark) 那么如何才能实现这一目标,而不是.sliding

添加代码以开始使用 -

public static void biGram (JavaRDD<String> in)
{
    JavaRDD<String> sentence = in.map(s -> s.toLowerCase());
    //get bigram from sentence w/o sliding - CODE HERE
}

2 个答案:

答案 0 :(得分:1)

滑动确实是ngrams的方法。事实是,滑动在迭代器上工作,只是拆分你的句子并滑过数组。我正在添加Scala代码。

val sentences:RDD[String] = in.map(s => s.toLowerCase())
val biGrams:RDD[Iterator[Array[String]]] = sentences.map(s => s.split(" ").sliding(2))     

答案 1 :(得分:1)

您可以在spark中使用n-gram转换功能。

public static void biGram (JavaRDD<String> in)
{
    //Converting string into row
    JavaRDD<Row> sentence = sentence.map(s -> RowFactory.create(s.toLowerCase()));

    StructType schema = new StructType(new StructField[] {
            new StructField("sentence", DataTypes.StringType, false, Metadata.empty())  
    });

    //Creating dataframe
    DataFrame dataFrame = sqlContext.createDataFrame(sentence, schema);

    //Tokenizing sentence into words
    RegexTokenizer rt = new RegexTokenizer().setInputCol("sentence").setOutputCol("split")
            .setMinTokenLength(4)
            .setPattern("\\s+");
    DataFrame rtDF = rt.transform(dataFrame);

    //Creating bigrams
    NGram bigram = new NGram().setInputCol(rt.getOutputCol()).setOutputCol("bigram").setN(2);  //Here setN(2) means bigram
    DataFrame bigramDF = bigram.transform(rtDF);


    System.out.println("Result :: "+bigramDF.select("bigram").collectAsList());
}