我有一个包含多个网址的文件。 我想阅读每个URL并对其进行一些处理。 由于处理部分对于每个URL都是独立的,我想在Spark上并行执行此操作。
SparkConf conf = new SparkConf().setAppName("org.sparkexample.WordCount").setMaster("local");
JavaSparkContext context = new JavaSparkContext(conf);
JavaRDD<String> textFile = context.textFile("urlFile");
/* Now for each line of this textFile I need to call below */
ExtractTrainingData ed = new ExtractTrainingData();
List<Elements> list = ed.getElementList(inputUrl);
ed.processElementList( inputUrl, list);
任何人都可以建议我该怎么做?
答案 0 :(得分:1)
如果每个URL都在其他行中,那么您可以执行foreach:
JTextArea
如果结果列表也应该并行化,那么:
SparkConf conf = new SparkConf().setAppName("org.sparkexample.WordCount").setMaster("local");
JavaSparkContext context = new JavaSparkContext(conf);
JavaRDD<String> textFile = context.textFile("urlFile");
textFile.foreach (new VoidFunction<String>() {
public void call (String line) {
// this code will be executed parallely for each line in file
ExtractTrainingData ed = new ExtractTrainingData();
List<Elements> list = ed.getElementList(inputUrl);
ed.processElementList( inputUrl, list);
}
});
我使用过lambda语法,你可以使用匿名函数
修改:确保SparkConf conf = new SparkConf().setAppName("org.sparkexample.WordCount").setMaster("local");
JavaSparkContext context = new JavaSparkContext(conf);
JavaRDD<String> textFile = context.textFile("urlFile");
textFile.map (new Function<String, List<Elements>() {
public List<Elements> call (String line) {
// this code will be executed parallely for each line in file
ExtractTrainingData ed = new ExtractTrainingData();
List<Elements> list = ed.getElementList(inputUrl);
return list;
}
}).flatMap (list -> list.iterator())
.foreach ((String element) -> {
// here put code that is in processElementList
});
可序列化