我第一次玩Hadoop,写了一个MapReducer来获取一个日志文件并将其缩小。但是,输出很奇怪。
看起来像这样:
2f09 3133 3134 3838 0a2f 0009 3137 0a2f
0000 1908 efbf bd7b efbf bd44 11ef bfbd
efbf bd2a efbf bdef bfbd 301b 79ef bfbd
5bef bfbd d290 efbf bdef bfbd 5349 efbf
bd5c efbf bd24 32ef bfbd 7bef bfbd 58ef
bfbd efbf bd16 efbf bdef bfbd 20ef bfbd
52ef bfbd 1fd7 ac1b efbf bd21 672b df86
3603 031a 54ef bfbd efbf bd09 310a 2f00
002b efbf bd53 53ef bfbd 2bef bfbd efbf
bd63 6125 efbf bdef bfbd 3c17 024e 4eef
bfbd efbf bd1d 7e72 efbf bd18 efbf bd4b
2332 efbf bdef bfbd 04ef bfbd 1d19 efbf
bd67 5a33 3270 7bef bfbd 75ef bfbd 6def
bfbd 0931 0a2f 0000 46ef bfbd ddb5 efbf
bd4d 62ef bfbd 7751 2048 efbf bdef bfbd
14ef bfbd efbf bdef bfbd 5463 efbf bdef
bfbd 5f12 efbf bdef bfbd 77ef bfbd 5fef
bfbd efbf bdef bfbd 32ef bfbd dd88 efbf
bdd8 b309 310a 2f00 0072 ccbd 0931 0a2f
0000 7457 efbf bdef bfbd 1632 efbf bdef
bfbd 21ef bfbd efbf bdef bfbd 563d 66ef
我最初尝试使用一个小得多的文件,它的可读格式很好。所以我不完全确定问题是什么...在文件被MapReduced时,编码是否在某个时间点发生了变化?
我完全不知道,所以会很感激帮助找出错误,以及我可以做些什么来解决它,或者防止它再次发生。
谢谢!
编辑:添加了代码
值得庆幸的是,这是一个很好的短文件,因为它是我的第一个...我删除了导入和东西,试图让它更短。
public class WCSLogParse {
public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private static final Log LOG = LogFactory.getLog(WCSLogParse.class);
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String line = value.toString();
line = splitToUrl(line);
LOG.info("Line is "+line);
if(line.contains(".")) {
//do nothing
LOG.info("Skipping line");
}
else {
int lastSlash = line.lastIndexOf("/");
line = line.substring(lastSlash);
LOG.info("Command is "+line);
context.write(new Text(line), one);
}
}
private String splitToUrl(String line) {
int subBegin = line.indexOf('/');
int subEnd = line.indexOf(',',subBegin);
if(subBegin == -1 || subEnd == -1) {
return ".";
}
String url = line.substring(subBegin, subEnd);
//handles if it is from a CSV field
if(url.endsWith("\"")) {
url = url.substring(0, (url.length()-1));
}
return url;
}
private String getUrl(String line) {
String[] cols = line.split(",");
return cols[7];
}
};
public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
context.write(key, new IntWritable(sum));
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(conf);
System.out.println("the args are "+args.toString());
Job job = new Job(conf, "WCSLogParse");
job.setJarByClass(WCSLogParse.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setMapperClass(Map.class);
job.setCombinerClass(Reduce.class);
job.setReducerClass(Reduce.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.waitForCompletion(true);
}
}
我正在使用以下参数在日食中开始工作:
"/Volumes/Secondary Documents/logs/" "/Volumes/Secondary Documents/logs/output"