hdfs-sink:如何摆脱HDFS文件中由flume在每个事件中添加的时间戳

时间:2017-09-13 04:43:22

标签: flume-ng

我有几个文件,每行包含json

[root@ip-172-29-1-12 vp_flume]# more vp_170801.txt.finished | awk '{printf("%s\n", substr($0,0,20))}'
{"status":"OK","resp
{"status":"OK","resp
{"status":"OK","resp
{"status":"OK","resp
{"status":"OK","resp
{"status":"OK","resp
{"status":"OK","resp
{"status":"OK","resp
{"status":"OK","resp
{"status":"OK","resp

我的水槽配置

[root@ip-172-29-1-12 flume]# cat flume_test.conf 
agent.sources = seqGenSrc
agent.channels = memoryChannel
agent.sinks = loggerSink

agent.sources.seqGenSrc.type = spooldir
agent.sources.seqGenSrc.spoolDir = /moveitdata/dong/vp_flume
agent.sources.seqGenSrc.deserializer.maxLineLength = 10000000
agent.sources.seqGenSrc.fileSuffix = .finished
agent.sources.seqGenSrc.deletePolicy = never

agent.sources.seqGenSrc.channels = memoryChannel
agent.sinks.loggerSink.channel = memoryChannel
agent.channels.memoryChannel.type = memory
agent.channels.memoryChannel.capacity = 100

agent.sinks.loggerSink.type = hdfs
agent.sinks.loggerSink.hdfs.path = /home/dong/vp_flume

agent.sinks.loggerSink.hdfs.writeFormat = Text
agent.sinks.loggerSink.hdfs.rollInterval = 0
agent.sinks.loggerSink.hdfs.rollSize = 1000000000
agent.sinks.loggerSink.hdfs.rollCount = 0

HDFS中的文件是:

[root@ip-172-29-1-12 flume]# hadoop fs -text /home/dong/vp_flume/* | awk '{printf("%s\n", substr($0,0,20))}' | more
1505276698665   {"stat
1505276698665   {"stat
1505276698666   {"stat
1505276698666   {"stat
1505276698666   {"stat
1505276698667   {"stat
1505276698667   {"stat
1505276698667   {"stat
1505276698668   {"stat
1505276698668   {"stat
1505276698668   {"stat
1505276698668   {"stat
1505276698669   {"stat
1505276698669   {"stat
1505276698669   {"stat
1505276698669   {"stat
1505276698670   {"stat
1505276698670   {"stat
1505276698670   {"stat
1505276698670   {"stat

问题:我不喜欢每个事件中由水槽添加的时间戳。但是,如何通过正确配置水槽来摆脱它呢?

1 个答案:

答案 0 :(得分:1)

您尚未在代理配置文件中明确提及hdfs.fileType属性,因此Flume将使用SequenceFile作为默认值。 SequenceFile支持两种写入格式:TextWritable。您已设置hdfs.writeFormat = Text,这意味着Flume将使用HDFSTextSerializer来序列化您的活动。如果您查看its source(第53行),您会看到它添加了一个时间戳作为默认密钥。

使用hdfs.writeFormat = Writable也无济于事,因为它也是如此。您可以查看其来源here(第52行)。

SequenceFile始终需要密钥。因此,除非您有充分的理由使用SequenceFile,否则我建议您在代理配置中使用hdfs.fileType = DataStream