Hadoop程序(java)读取逗号分隔的输入文件

时间:2016-03-01 14:11:30

标签: java csv hadoop

我的输入文件如下所示

000001928162247ffaf63185cd8b2a244c78e7c6,2009324,abcat0101001,Sharp,"2011-09-05 12:25:37.42","2011-09-05 12:25:01.187"
0001be1731ee7d1c519bc7e87110c9eb880cb396,1649294,abcat0715001,"Gunnar eyewear","2011-09-23 17:13:36.175","2011-09-23 17:12:18.389"
0001bfa0c494c01f9f8c141c476c11bb4625a746,17240521,cat02015,refrigerator,"2011-10-19 23:43:51.71","2011-10-19 23:43:06.485"
0001fb09f03fea4d04e2267ed3194c806839d997,1271997,abcat0513004,Razer,"2011-09-07 09:20:07.11","2011-09-07 09:19:03.279"
0002965b083b6e508f7740c47c8f39e1072b4219,3562379,pcmcat209400050001,"I phone 4","2011-10-27 14:10:31.92","2011-10-27 14:09:33.327"
0002bb28a9ca07f5515b01996fd5d7ca84742e41,3230638,pcmcat177200050009,"hd antenna","2011-10-20 00:03:49.966","2011-10-20 00:02:01.458"
0002bd9c3d654698bb514194c4f4171ad6992266,9947181,pcmcat253300050012,printer,"2011-10-06 19:51:40.984","2011-10-06 19:47:13.803"
0002fee45e1c32eb94e82fc6c15c4db14e796248,3519969,pcmcat247400050000,vaio,"2011-10-19 23:31:51.015","2011-10-19 23:31:12.213"
00042033d355973baf9454b021a15c6b5b48f4a3,2677297,pcmcat212600050008,"desk top","2011-08-29 12:03:38.265","2011-08-29 12:03:12.348"
000433e0ef411c2cb8ee1727002d6ba15fe9426b,8959317,cat02015,"how i met your mother","2011-09-17 19:44:40.129","2011-09-17 19:43:37.564"

包含以下信息

user_id,product_id,category,query,click_time,query_time

我想在Hadoop中读取此文件并提取user_id和类别(第1和第3个字段)。我有一个基本的Hadoop程序,如下所示,我用于wordcount。在此任务中,项目以逗号分隔,我必须将它们存储在ArrayList中。

这是我的首发计划:

import java.io.IOException;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;


import org.apache.hadoop.mapreduce.lib.map.TokenCounterMapper;
import org.apache.hadoop.mapreduce.lib.reduce.IntSumReducer;

import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.IntWritable;

public class popularcats extends Configured implements Tool {

  public int run(String[] args) throws Exception {
    Job job = new Job(getConf());
    job.setJarByClass(getClass());
    job.setMapperClass(TokenCounterMapper.class);
    job.setReducerClass(IntSumReducer.class);

    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(IntWritable.class);

    FileInputFormat.addInputPath(job, new Path(args[0]));
    FileOutputFormat.setOutputPath(job, new Path(args[1]));
    return job.waitForCompletion(true) ? 0 : 1;
  }

  public static void main(String [] args) throws Exception {
    int exitCode = ToolRunner.run(new popularcats(), args);
    System.exit(exitCode);
  }

}

我认为Hadoop必须有一些用于在CSV文件中读取的类。我从这个地址找到了这个类CSVLineRecordReader https://github.com/mvallebr/CSVInputFormat/blob/master/src/main/java/org/apache/hadoop/mapreduce/lib/input/CSVLineRecordReader.java

那么我该如何处理这个文件并提取所需的字段呢?

0 个答案:

没有答案