我如何才能在我的map / reduce作业上仅输出一次标题,作为cive进行蜂巢导入,而不是手动输入列名。
公共类MyMapper扩展了Mapper {
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
try {
InputStream is = new ByteArrayInputStream(value.toString().getBytes());
DocumentBuilderFactory dbFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder dBuilder = dbFactory.newDocumentBuilder();
Document doc = dBuilder.parse(is);
//....
doc.getDocumentElement().normalize();
// .......
//context.write(new Text("el_from \t Title \t External Link"), NullWritable.get());
// ....
String title = eElement.getElementsByTagName("title").item(0).getTextContent();
text = eElement.getElementsByTagName("text").item(0).getTextContent();
String id = eElement.getElementsByTagName("id").item(0).getTextContent();
for(int j = 0; j < externalLinks.length; j++)
{
Pattern prl = Pattern.compile("(http:\\/\\/www\\.|https:\\/\\/www\\.|http:\\/\\/|https:\\/\\/)?[a-z0-9]+([\\-\\.]{1}[a-z0-9]+)*\\.[a-z]{2,5}(:[0-9]{1,5})?");
Matcher ml = prl.matcher(externalLinks[j]);
if(ml.find()) {
MatchResult mlr = ml.toMatchResult();
context.write(new Text(id+","+title + ","+ mlr.group(0)), NullWritable.get());
}
}
}
}
} catch (Exception e) {
// LogWriter.getInstance().WriteLog(e.getMessage());
}
}
}`enter code here`
我得到的结果就是这样
3,agricoltura,http://www.treccani.it
3,agricoltura,http://www.wwf.it/client/render.aspx
我想要的结果类似于下面的标题
id,标题,链接
3,agricoltura,http://www.treccani.it
3,agricoltura,http://www.wwf.it/client/render.aspx
答案 0 :(得分:0)
您应该在文本文件上构建一个Hive表,这将在Hive模式中定义“标题”,而不是Hive表中的另一个随机行。更重要的是,Map Reduce不能保证您的标题是文件的第一行。
CREATE EXTERNAL TABLE x (
id INT, title STRING, link STRING
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION 'hdfs://mapred/outputDir';
由此,您可以编写一个Hive查询,以在需要时输出到单独的CSV文件
我相信,Spark可以读取XML,对其进行解析并使用标头写CSV,这可能对您的用例更好