我正在尝试从本地目录中摄取文本数据到 HDFS ,然后在摄取之前我需要将文本转换为有效的json。为此,我使用JavaScript Evaluator处理器。
在javascript评估器中,我无法读取任何记录。
以下是我的示例代码:
for(var i = 0; i < records.length; i++) {
try {
output.write(records[i]);
} catch (e) {
error.write(records[i], e);
}
}
除了JavaScript评估者之外还有其他更好的选择吗?
以下是我的示例输入数据:
{
1046=
1047=
1048=5324800
1049=20180508194648
1095=2297093400,
1111=up_default
1118=01414011002101251
1139=1
}
{
1140=1
1176=mdlhggsn01_1.mpt.com;3734773893;2472;58907
1183=4
1211=07486390
1214=0
1227=51200
1228=111
1229=0
1250=614400,
}
更新
根据@metadaddy的回答,我尝试使用Groovy insted JavaScript。对于@metadaddy在答案中显示的相同数据,我得到了以下异常。
答案 0 :(得分:1)
您的JavaScript需要通读输入,构建输出记录。
使用文字格式,目录来源将为每行输入创建一个/text
字段的记录。
这个JavaScript将构建您需要的记录结构:
for(var i = 0; i < records.length; i++) {
try {
// Start of new input record
if (records[i].value.text.trim() === '{') {
// Use starting input record as output record
// Save in state so it persists across batches
state.outRecord = records[i];
// Clean out the value
state.outRecord.value = {};
// Move to next line
i++;
// Read values to end of input record
while (i < records.length && records[i].value.text.trim() !== '}') {
// Split the input line on '='
var kv = records[i].value.text.trim().split('=');
// Check that there is something after the '='
if (kv[1].length > 0) {
state.outRecord.value[kv[0]] = kv[1];
} else if (kv[0].length > 0) {
state.outRecord.value[kv[0]] = NULL_STRING;
}
// Move to next line of input
i++;
}
// Did we hit the '}' before the end of the batch?
if (i < records.length) {
// Write record to processor output
output.write(state.outRecord);
log.debug('Wrote a record with {} fields',
Object.keys(state.outRecord.value).length);
state.outRecord = null;
}
}
} catch (e) {
// Send record to error
log.error('Error in script: {}', e);
error.write(records[i], e);
}
}
以下是样本输入数据转换的预览:
现在,要将整个记录作为JSON写入HDFS,只需将Hadoop FS目标中的数据格式设置为JSON。
答案 1 :(得分:1)
StreamSets Data Collector中的Groovy脚本执行速度比JavaScript快得多,因此Groovy中的解决方案也是如此。
使用文字格式,目录来源将为每行输入创建一个/text
字段的记录。
此脚本将构建您需要的记录结构:
for (i = 0; i < records.size(); i++) {
try {
// Start of new input record
if (records[i].value['text'].trim() == "{") {
// Use starting input record as output record
// Save in state so it persists across batches
state['outRecord'] = records[i]
// Clean out the value
state['outRecord'].value = [:]
// Move to next line
i++
// Read values to end of input record
while (i < records.size() && records[i].value['text'].trim() != "}") {
// Split the input line on '='
def kv = records[i].value['text'].trim().split('=')
// Check that there is something after the '='
if (kv.length == 2) {
state['outRecord'].value[kv[0]] = kv[1]
} else if (kv[0].length() > 0) {
state['outRecord'].value[kv[0]] = NULL_STRING
}
// Move to next line of input
i++
}
// Did we hit the '}' before the end of the batch?
if (i < records.size()) {
// Write record to processor output
output.write(state['outRecord'])
log.debug('Wrote a record with {} fields',
state['outRecord'].value.size());
state['outRecord'] = null;
}
}
} catch (e) {
// Write a record to the error pipeline
log.error(e.toString(), e)
error.write(records[i], e.toString())
}
}
在输入数据上运行:
{
1=959450992837
2=95973085229
3=1525785953
4=29
7=2
8=
9=
16=abd
20=def
21=ghi;jkl
22=a@b.com
23=1525785953
40=95973085229
41=959450992837
42=0
43=0
44=0
45=0
74=1
96=1
98=4
99=3
}
提供输出:
{
"1": "959450992837",
"2": "95973085229",
"3": "1525785953",
"4": "29",
"7": "2",
"8": null,
"9": null,
"16": "abd",
"20": "def",
"21": "ghi;jkl",
"22": "a@b.com",
"23": "1525785953",
"40": "95973085229",
"41": "959450992837",
"42": "0",
"43": "0",
"44": "0",
"45": "0",
"74": "1",
"96": "1",
"98": "4",
"99": "3"
}