带标头的Hadoop MapReduce输出

时间:2018-07-02 20:18:01

标签: java csv hadoop mapreduce

我如何才能在我的map / reduce作业上仅输出一次标题,作为cive进行蜂巢导入,而不是手动输入列名。

公共类MyMapper扩展了Mapper {

    public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
    try {
        InputStream is = new ByteArrayInputStream(value.toString().getBytes());
        DocumentBuilderFactory dbFactory = DocumentBuilderFactory.newInstance();
        DocumentBuilder dBuilder = dbFactory.newDocumentBuilder();
        Document doc = dBuilder.parse(is);
        //....

        doc.getDocumentElement().normalize();

        // .......
        //context.write(new Text("el_from \t Title \t External Link"), NullWritable.get());
        // .... 
                String title = eElement.getElementsByTagName("title").item(0).getTextContent();
                text = eElement.getElementsByTagName("text").item(0).getTextContent();
                String id = eElement.getElementsByTagName("id").item(0).getTextContent(); 
                 for(int j = 0; j <  externalLinks.length; j++)
                 {
                     Pattern prl = Pattern.compile("(http:\\/\\/www\\.|https:\\/\\/www\\.|http:\\/\\/|https:\\/\\/)?[a-z0-9]+([\\-\\.]{1}[a-z0-9]+)*\\.[a-z]{2,5}(:[0-9]{1,5})?");
                     Matcher ml = prl.matcher(externalLinks[j]);
                     if(ml.find()) {
                         MatchResult mlr = ml.toMatchResult();
                         context.write(new Text(id+","+title + ","+ mlr.group(0)), NullWritable.get());
                     }                       
                 }
            }
        }
    } catch (Exception e) {
        // LogWriter.getInstance().WriteLog(e.getMessage());
    }
    }
    }`enter code here`

我得到的结果就是这样

3,agricoltura,http://www.treccani.it

3,agricoltura,http://www.wwf.it/client/render.aspx

我想要的结果类似于下面的标题

id,标题,链接

3,agricoltura,http://www.treccani.it

3,agricoltura,http://www.wwf.it/client/render.aspx

1 个答案:

答案 0 :(得分:0)

您应该在文本文件上构建一个Hive表,这将在Hive模式中定义“标题”,而不是Hive表中的另一个随机行。更重要的是,Map Reduce不能保证您的标题是文件的第一行。

CREATE EXTERNAL TABLE x ( 
  id INT, title STRING, link STRING
) 
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION 'hdfs://mapred/outputDir';

由此,您可以编写一个Hive查询,以在需要时输出到单独的CSV文件


我相信,Spark可以读取XML,对其进行解析并使用标头写CSV,这可能对您的用例更好