嗨,我有用Hadoop编写的代码,现在我尝试迁移到Spark。映射器和减速器相当复杂。所以我尝试在 spark 程序中重用已存在的 Hadoop 代码的Mapper和Reducer类。有人能告诉我如何实现这个目标吗?
修改
到目前为止,我已经能够在spark中重用 mapper 类标准hadoop字数计算示例,实现如下
的 wordcount.java
import scala.Tuple2;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.*;
import org.apache.spark.api.java.function.*;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;
import org.apache.hadoop.util.*;
import java.io.*;
import java.util.Arrays;
import java.util.List;
import java.util.regex.Pattern;
public final class wordcount extends Configured implements Serializable {
public static int main(String[] args) throws Exception{
SparkConf sparkConf = new SparkConf().setMaster("spark://IMPETUS-I0203:7077").setAppName("wordcount");
JavaSparkContext ctx = new JavaSparkContext(sparkConf); //created Spark context
JavaRDD<String> rec = ctx.textFile("hdfs://localhost:54310/input/words.txt"); //Record Reader
//creating a Pair RDD whose key=some arbitrary number, value = a Record
JavaPairRDD<LongWritable,Text> lines =rec.mapToPair(s->new Tuple2<LongWritable,Text>(new LongWritable(s.length()),new Text(s)));
//transforming 'lines' RDD to another such that it returns for example ('word',1) tuple.
JavaPairRDD<Text,IntWritable> ones = lines.flatMapToPair(it -> {
NotSerializableException notSerializable = new NotSerializableException();
JobConf conf = new JobConf(new Configuration(), wordcount.class);
conf.setJobName("WordCount");
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
Path inp = new Path("hdfs://localhost:54310/input/darcy.txt");
FileInputFormat.addInputPath(conf, inp);
FileOutputFormat.setOutputPath(conf, out);
WordCountMapper mapper = new WordCountMapper();
mapper.configure(conf);
OutputCollector<Text,IntWritable> output = new outputcollector<Text,IntWritable>() ;
mapper.map(it._1, it._2 , output, Reporter.NULL);
return ((outputcollector)output).getList();
});
ones.saveAsTextFile("hdfs://localhost:54310/output/41");
return 0;
}
}
的 WordCountMapper.java
import java.io.*;
import java.util.StringTokenizer;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;
import java.io.Serializable;
public class WordCountMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable>,Serializable
{
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException
{
NotSerializableException notSerializable = new NotSerializableException();
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens())
{
word = new Text();
word.set(tokenizer.nextToken());
output.collect(word, one);
}
}
}
的 outputcollector.java
import java.util.ArrayList;
import java.util.List;
import org.apache.hadoop.mapred.*;
import scala.Tuple2;
public class outputcollector<K extends Object, V extends Object> implements OutputCollector<K, V>{
private List<Tuple2<K,V>> writer = new ArrayList<Tuple2<K,V>>();
@Override
public void collect(K key, V value) {
try{
writer.add(new Tuple2<K,V>(key,value));
}catch(Exception e){
System.out.println(e+ "\n\n****output collector error\n\n");
}
}
public List<Tuple2<K,V>> getList(){
return writer;
}
}
此代码完美无缺,我可以成功提交此spark作业。与纯火花程序相比,它在某种程度上高度低效。它比简单的火花字数例子长约50倍。输入文件 1 GB 。输入文件存在于 HDFS 上。以独立模式运行。
我无法找到这个代码作为一个懒惰的慢的原因。在这里,我使用WordCountMapper.java来简单地收集对(word,1)。那也在记忆中起作用。所以我不明白为什么我的代码必须比标准的spark字数例子慢得多。
那么,任何人都可以建议在spark中重用WordCountMapper.java(hadoop mapper)的更好的方法吗?或解释为什么它这么慢?还是有助于实现我最终目标的任何事情? (在我的问题顶部提到。)
答案 0 :(得分:0)
将mapreduce转换为spark的基本方法是:
rdd.mapPartitions(partition ->
setup() //map setup
partition.map( item =>
val output = process(item)
if (!partition.hasNext) {
// Some cleanup code here
}
)
).groupByKey().mapPartitions( //similarly for reduce code).saveAsHadoopFile(//params) //To save on HDFS
以下链接指向cloudera上的两篇文章。并非一切都在讨论,但如果你仔细研究,你就会得到如何将hadoop作业的某些部分转换成火花的要点。例如,如何进行设置和清理。
http://blog.cloudera.com/blog/2014/09/how-to-translate-from-mapreduce-to-apache-spark/
注意:我尝试将mapreduce转换为spark,但导致应用程序速度变慢。也许这是我自己使用scala的低效率,或者火花不适合批处理作业。所以也要注意这一点。