来自JavaRDD的字符串生成<string>

时间:2017-10-15 04:15:50

标签: java apache-spark rdd

我现在在 static void Main(string[] args) { //a list with a possible of duplicate var theList = (new int[] { 1, 2, 3, 5, 7, 8, 11, 13, 14, 13 }).OrderBy(x => x).ToList(); var step1 = theList.Select((a, b) => theList.Skip(b).TakeWhile((x, y) => a == x || theList[b + y] - 1 == theList[b + y - 1])); var step2 = step1.GroupBy(x => x.Last()) .Select(x => x.SelectMany(y => y).Distinct()) .Select(x => x.Count() > 1 ? string.Format("{0}-{1}", x.First(), x.Last()) : x.First().ToString()); var result = string.Format("[{0}]", string.Join(", ", step2)); } 中有一个单词列表,如何将其转换为另一个JavaRDD<String> words,其中包含由前一个单词列表组成的N-gram列表?

以下是我现在获得的代码:

JavaRDD<String> NGram

但是,我知道将RDD收集到List中的速度很慢。我想知道是否有任何方法可以直接将RDD List<String> word = words.collect(); List<String> list = new ArrayList<String>(); for (int i = 0; i < (word.size() - n + 1); i++){ String nGram = word.get(i); for(int j = 1; j < n; j++){ nGram = nGram + " " + word.get(i + j); } //Add n-gram to a list list.add(nGram); } JavaRDD<String> NGram = sc.parallelize(list); 转换为RDD words

0 个答案:

没有答案