我正在尝试在Hadoop map-reduce
中编写以下代码。我有一个日志文件,其中包含IP地址和由其后面的相应IP打开的URL。它如下:
192.168.72.224 www.m4maths.com
192.168.72.177 www.yahoo.com
192.168.72.177 www.yahoo.com
192.168.72.224 www.facebook.com
192.168.72.224 www.gmail.com
192.168.72.177 www.facebook.com
192.168.198.92 www.google.com
192.168.198.92 www.yahoo.com
192.168.72.224 www.google.com
192.168.72.177 www.yahoo.com
192.168.198.92 www.google.com
192.168.72.224 www.indiabix.com
192.168.72.177 www.yahoo.com
192.168.72.224 www.google.com
192.168.72.177 www.yahoo.com
192.168.72.224 www.yahoo.com
192.168.198.92 www.m4maths.com
192.168.198.92 www.facebook.com
192.168.72.224 www.gmail.com
192.168.72.177 www.google.com
192.168.72.224 www.indiabix.com
192.168.72.224 www.indiabix.com
192.168.72.177 www.m4maths.com
192.168.72.224 www.indiabix.com
192.168.198.92 www.google.com
192.168.72.177 www.yahoo.com
192.168.198.92 www.yahoo.com
192.168.72.177 www.yahoo.com
192.168.198.92 www.facebook.com
192.168.198.92 www.indiabix.com
192.168.72.177 www.indiabix.com
192.168.72.224 www.google.com
192.168.198.92 www.askubuntu.com
192.168.198.92 www.askubuntu.com
192.168.198.92 www.facebook.com
192.168.198.92 www.gmail.com
192.168.72.177 www.facebook.com
192.168.72.177 www.yahoo.com
192.168.198.92 www.m4maths.com
192.168.72.224 www.yahoo.com
192.168.72.177 www.google.com
192.168.72.177 www.m4maths.com
192.168.72.177 www.yahoo.com
192.168.72.224 www.m4maths.com
192.168.72.177 www.yahoo.com
192.168.72.177 www.yahoo.com
192.168.72.224 www.facebook.com
192.168.72.224 www.gmail.com
192.168.72.177 www.facebook.com
192.168.198.92 www.google.com
192.168.198.92 www.yahoo.com
192.168.72.224 www.google.com
192.168.72.177 www.yahoo.com
192.168.198.92 www.google.com
192.168.72.224 www.indiabix.com
192.168.72.177 www.yahoo.com
192.168.72.224 www.google.com
192.168.72.177 www.yahoo.com
192.168.72.224 www.yahoo.com
192.168.198.92 www.m4maths.com
192.168.198.92 www.facebook.com
192.168.72.224 www.gmail.com
192.168.72.177 www.google.com
192.168.72.224 www.indiabix.com
192.168.72.224 www.indiabix.com
192.168.72.177 www.m4maths.com
192.168.72.224 www.indiabix.com
现在,我需要以这样的方式组织此文件的结果:它列出了不同的IP地址和Url,后跟该IP打开的特定次数。
例如,如果192.168.72.224
按照整个日志文件打开www.yahoo.com
15次,则输出必须包含:
192.168.72.224 www.yahoo.com 15
应该对文件中的所有IP执行此操作,最终输出应如下所示:
192.168.72.224 www.yahoo.com 15
www.m4maths.com 11
192.168.72.177 www.yahoo.com 6
www.gmail.com 19
....
...
..
.
我尝试的代码是:
public class WordCountMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable>
{
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException
{
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens())
{
word.set(tokenizer.nextToken());
output.collect(word, one);
}
}
}
我知道这段代码存在严重缺陷,请建议我继续前进。
谢谢。
答案 0 :(得分:1)
我会提出这个设计:
实现这一点需要您实现自定义可写以处理一对。
我个人会用Spark做这个,除非你太关注性能。使用PySpark,它就像这样简单:
rdd = sc.textFile('/sparkdemo/log.txt')
counts = rdd.map(lambda line: line.split()).map(lambda line: ((line[0], line[1]), 1)).reduceByKey(lambda x, y: x+y)
result = counts.map(lambda ((ip, url), cnt): (ip, (url, cnt))).groupByKey().collect()
for x in result:
print 'IP: %s' % x[0]
for w in x[1]:
print ' website: %s count: %d' % (w[0], w[1])
您的示例的输出将是:
IP: 192.168.72.224
website: www.facebook.com count: 2
website: www.m4maths.com count: 2
website: www.google.com count: 5
website: www.gmail.com count: 4
website: www.indiabix.com count: 8
website: www.yahoo.com count: 3
IP: 192.168.72.177
website: www.yahoo.com count: 14
website: www.google.com count: 3
website: www.facebook.com count: 3
website: www.m4maths.com count: 3
website: www.indiabix.com count: 1
IP: 192.168.198.92
website: www.facebook.com count: 4
website: www.m4maths.com count: 3
website: www.yahoo.com count: 3
website: www.askubuntu.com count: 2
website: www.indiabix.com count: 1
website: www.google.com count: 5
website: www.gmail.com count: 1
答案 1 :(得分:1)
我在java中编写了相同的逻辑
public class UrlHitMapper extends Mapper<Object, Text, Text, Text>{
public void map(Object key, Text value, Context contex) throws IOException, InterruptedException {
System.out.println(value);
StringTokenizer st=new StringTokenizer(value.toString());
if(st.hasMoreTokens())
contex.write(new Text(st.nextToken()), new Text(st.nextToken()));
}
}
public class UrlHitReducer extends Reducer<Text, Text, Text, Text>{
public void reduce(Text key, Iterable<Text> values, Context context)
throws IOException, InterruptedException {
HashMap<String, Integer> urlCount=new HashMap<>();
String url=null;
Iterator<Text> it=values.iterator();
while (it.hasNext()) {
url=it.next().toString();
if(urlCount.get(url)==null)
urlCount.put(url, 1);
else
urlCount.put(url,urlCount.get(url)+1);
}
for(Entry<String, Integer> k:urlCount.entrySet())
context.write(key, new Text(k.getKey()+" "+k.getValue()));
}
}
public class UrlHitCount extends Configured implements Tool {
public static void main(String[] args) throws Exception {
ToolRunner.run(new Configuration(), new UrlHitCount(), args);
}
public int run(String[] arg0) throws Exception {
Job job = Job.getInstance(getConf());
job.setJobName("url-hit-count");
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
job.setMapperClass(UrlHitMapper.class);
job.setReducerClass(UrlHitReducer.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.setInputPaths(job, new Path("input/urls"));
FileOutputFormat.setOutputPath(job, new Path("url_otput"+System.currentTimeMillis()));
job.setJarByClass(WordCount.class);
job.submit();
return 1;
}
}