当试图在spark mllib中执行Prefixspan算法时,我收到错误
PrefixSpan类型中的方法run(JavaRDD Sequence)不适用于参数(JavaRDD List String)
我在网站上看到的代码是
JavaRDD<List<List<Integer>>> sequences = sc.parallelize(Arrays.asList(
Arrays.asList(Arrays.asList(1, 2), Arrays.asList(3)),
Arrays.asList(Arrays.asList(1), Arrays.asList(3, 2), Arrays.asList(1, 2)),
Arrays.asList(Arrays.asList(1, 2), Arrays.asList(5)),
Arrays.asList(Arrays.asList(6))), 2);
PrefixSpan prefixSpan = new PrefixSpan().setMinSupport(0.5).setMaxPatternLength(5);
PrefixSpanModel<Integer> model = prefixSpan.run(sequences);
for (PrefixSpan.FreqSequence<Integer> freqSeq: model.freqSequences().toJavaRDD().collect()) {
System.out.println(freqSeq.javaSequence() + ", " + freqSeq.freq());
}
我的代码是
List<List<String>> sequences = createLists(featuresForAlgo);
JavaRDD<List<String>> rdd = sc.parallelize(sequences);
PrefixSpan prefixSpan = new PrefixSpan()
.setMinSupport(0.5)
.setMaxPatternLength(5);
PrefixSpanModel<String> model = prefixSpan.run(rdd);
for (PrefixSpan.FreqSequence<Integer> freqSeq: model.freqSequences().toJavaRDD().collect()) {
System.out.println(freqSeq.javaSequence() + ", " + freqSeq.freq());
}
方法prefixSpan.run(rdd)给出错误。 知道为什么我会收到这个错误吗? 据我所知,List是一个序列。
由于
答案 0 :(得分:0)
错误有点误导,但是如果你看到PrefixSpan类的源代码,你会发现run方法参数就像
@param数据有序的项目集序列,存储为Iterables的Java Iterable
所以prefixSpan.run方法需要df <- data.frame(id = c(1,2,3,1,2), var1_dose = c(2,4,6,1,3),
var1_unit = c("mL","mg","mcg","mL","mL"), var2_dose = c(5,2,4,1,3),
var2_unit = c("mL","mg","mcg","mL","mL"), var3_dose = c(1,4,2,3,5),
var3_unit = c("mL","mg","mcg","mL","mL"))
dose_list <- lapply(seq(2,ncol(df)-1,2), function(x) paste0(df[, x],df[, x + 1]))
names(dose_list) <- c(paste0("dose_",seq(1:(ncol(df) / 2))))
。在您的代码中,您可以这样做
JavaRDD<List<List<String>>>