Question

我的数据存储在CSV文件中。我想阅读HDFS中的CSV文件。

任何人都可以帮我解决问题吗？

我是hadoop的新手。在此先感谢。

Answer 1

这需要的课程是FileSystem，FSDataInputStream和Path。客户应该是这样的：

public static void main(String[] args) throws IOException {
        // TODO Auto-generated method stub

        Configuration conf = new Configuration();
        conf.addResource(new Path("/hadoop/projects/hadoop-1.0.4/conf/core-site.xml"));
        conf.addResource(new Path("/hadoop/projects/hadoop-1.0.4/conf/hdfs-site.xml"));
        FileSystem fs = FileSystem.get(conf);
        FSDataInputStream inputStream = fs.open(new Path("/path/to/input/file"));
        System.out.println(inputStream.readChar());         
    }

FSDataInputStream有几种read方法。选择一个适合您需求的那个。

如果是MR，那就更容易了：

        public static class YourMapper extends
                    Mapper<LongWritable, Text, Your_Wish, Your_Wish> {

                public void map(LongWritable key, Text value, Context context)
                        throws IOException, InterruptedException {

                    //Framework does the reading for you...
                    String line = value.toString();      //line contains one line of your csv file.
                    //do your processing here
                    ....................
                    ....................
                    context.write(Your_Wish, Your_Wish);
                    }
                }
            }

Answer 2

如果你想使用mapreduce，你可以使用TextInputFormat逐行读取并解析mapper的map函数中的每一行。

其他选项是开发（或发现）CSV输入格式，以便从文件中读取数据。

此处有一个旧教程http://hadoop.apache.org/docs/r0.18.3/mapred_tutorial.html，但新版本中的逻辑相同

如果使用单个进程从文件读取数据，则与从任何其他文件系统读取文件相同。这里有一个很好的例子https://sites.google.com/site/hadoopandhive/home/hadoop-how-to-read-a-file-from-hdfs

HTH

如何从Hdfs读取CSV文件？

2 个答案: