我在某些文本文件中有大量的字符串,需要通过这样的算法转换此字符串:将字符串转换为小写并删除所有空格。
您能举例说明实现该算法的Hadoop MapReduce函数吗?
谢谢。
答案 0 :(得分:0)
我尝试了下面的代码并将输出放在一行。
公共课toUpper {
public static class textMapper extends Mapper<LongWritable,Text,NullWritable,Text>
{
Text outvalue=new Text();
public void map(LongWritable key,Text values,Context context) throws IOException, InterruptedException
{
String token;
StringBuffer br=new StringBuffer();
StringTokenizer st=new StringTokenizer(values.toString());
while(st.hasMoreTokens())
{
token=st.nextToken();
br.append(token.toUpperCase());
}
st=null;
outvalue.set(br.toString());
context.write(NullWritable.get(), outvalue);
br=null;
}
}
public static class textReduce extends Reducer<NullWritable,Text,NullWritable,Text>
{
Text outvale=new Text();
public void reduce(NullWritable key,Iterable<Text> values,Context context) throws IOException, InterruptedException
{
StringBuffer br=new StringBuffer();
for(Text st:values)
{
br.append(st.toString());
}
outvale.set(br.toString());
context.write(NullWritable.get(), outvale);
}
}
public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
Configuration conf=new Configuration();
@SuppressWarnings("deprecation")
Job job=new Job(conf,"touipprr");
job.setJarByClass(toUpper.class);
job.setMapperClass(textMapper.class);
job.setReducerClass(textReduce.class);
job.setOutputKeyClass(NullWritable.class);
job.setOutputValueClass(Text.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true)?1:0);
}
}
答案 1 :(得分:0)
在我玩map-reduce的那些日子里,我有类似的想法,必须有一些练习或技巧,我们可以通过这些练习或技术修改记录中的每个单词并完成所有清洁工作。
当我们回顾map-reduce的整个算法时,我们有一个map函数,它在分隔符的帮助下将传入的记录拆分成标记(也许你会更好地了解它们)。现在,让我们尝试以描述性方式处理您提供的问题陈述
以下是我刚接触map-reduce时会尝试做的事情:
> I will probably write a map() method which will split the lines for me
> I will possibly run out of options and write a reduce function
and somehow will be able to achieve my objective
上述做法完全可以,但有一种更好的技巧可以帮助您决定是否需要减少功能,从而您将有更多选择让您能够思考并完全专注于实现您的目标和还考虑优化代码。
在你的问题陈述陷入其中的这种情况下,一个班级来救我:ChainMapper
现在,ChainMapper将如何运作?以下几点需要考虑
<强> SplitMapper.java 强>
public class SplitMapper extends Mapper<LongWritable,Text,Text,IntWritable>{
@Override
public void map(Object key,Text value,Context context)
throws IOException,InterruptedException{
StringTokenizer xs=new StringTokenizer(value.toString());
IntWritable dummyValue=new IntWritable(1);
while(xs.hasMoreElements()){
String content=(String)xs.nextElement();
context.write(new Text(content),dummyValue);
}
}
}
的 LowerCaseMapper.java 强>
public class LowerCaseMapper extends Mapper<Text,IntWritable,Text,IntWritable>{
@Override
public void map(Text key,IntWritable value,Context context)
throws IOException,InterruptedException{
String val=key.toString().toLowerCase();
Text newKey=new Text(val);
Context.write(newKey,value);
}
}
由于我在这里执行一个wordcount所以我需要一个reducer
<强> ChainMapReducer.java 强>
public class ChainMapReducer extends Reducer<Text,IntWritable,Text,IntWritable>{
@Override
public void reduce(Text key,Iterable<IntWritable> value,Context context)
throws IOException,InterruptedException{
int sum=0;
for(IntWritable v:value){
sum+=value.get();
}
context.write(key,new IntWritables(sum));
}
}
为了能够成功实现chainmapper的概念,你必须注意驱动程序类的每个细节
<强> DriverClass.java 强>
public class DriverClass extends Configured implements Tool{
static Configuration cf;
public int run(String args[]) throws IOException,InterruptedException,ClassNotFoundException{
cf=new Configuration();
Job j=Job.getInstance(cf);
//configuration for the first mapper
Configuration.splitMapConfig=new Configuration(false);
ChainMapper.addMapper(j,SplitMapper.class,Object.class,Text.class,Text.class,IntWritable.class,splitMapConfig);
//configuration for the second mapper
Configuration.lowerCaseConfig=new Configuration(false);
ChainMapper.addMapper(j,LowerCaseMapper.class,Text.class,IntWritable.class,Text.class,IntWritable.class,lowerCaseConfig);
j.setJarByClass(DriverClass.class);
j.setCombinerClass(ChainMapReducer.class);
j.setOutputKeyClass(Text.class);
j.setOutputValueClass(IntWritable.class);
Path outputPath=new Path(args[1]);
FileInputFormat.addInputPath(j,new Path(args[0]));
FileOutputFormat.setOutputPath(j,outputPath);
outputPath.getFileSystem(cf).delete(outputPath,true);
}
public static void main(String args[]) throws Exception{
int res=ToolRunner.run(cf,new DriverClass(),args);
System.exit(1);
}
}
驱动程序类非常容易理解,只需要观察ChainMapper.add(<job-object>,<Map-ClassName>,<Input arguments types>,<configuration-for-the-concerned-mapper>)
我希望解决方案符合您的目的,如果您在尝试实施时可能出现任何问题,请告知我们。
三江源!