Flume在HDFS的输出文件末尾创建一个空行

时间:2016-01-13 15:48:11

标签: hadoop flume flume-ng bigdata

目前我正在使用Flume版本:1.5.2。

Flume在HDFS中的每个输出文件末尾创建一个空行,导致行数,文件大小和check sum与源文件和目标文件不匹配。

我尝试通过覆盖参数roolSize,batchSize和appendNewline的默认值但仍然无法正常工作。

此外还将EOL从CRLF(源文件)更改为LF(输出文件),这也导致文件大小不同

以下是我正在使用的相关水槽代理配置参数

 agent1.sources = c1
 agent1.sinks = c1s1
 agent1.channels = ch1

 agent1.sources.c1.type = spooldir
 agent1.sources.c1.spoolDir = /home/biadmin/flume-test/sourcedata1
 agent1.sources.c1.bufferMaxLineLength = 80000
 agent1.sources.c1.channels = ch1
 agent1.sources.c1.fileHeader = true 
 agent1.sources.c1.fileHeaderKey = file
 #agent1.sources.c1.basenameHeader = true
 #agent1.sources.c1.fileHeaderKey = basenameHeaderKey
 #agent1.sources.c1.filePrefix = %{basename}
 agent1.sources.c1.inputCharset = UTF-8
 agent1.sources.c1.decodeErrorPolicy = IGNORE
 agent1.sources.c1.deserializer= LINE
 agent1.sources.c1.deserializer.maxLineLength =  50000
 agent1.sources.c1.deserializer=
 org.apache.flume.sink.solr.morphline.BlobDeserializer$Builder
agent1.sources.c1.interceptors = a b
agent1.sources.c1.interceptors.a.type  =     
org.apache.flume.interceptor.TimestampInterceptor$Builder
agent1.sources.c1.interceptors.b.type =
org.apache.flume.interceptor.HostInterceptor$Builder
agent1.sources.c1.interceptors.b.preserveExisting = false
agent1.sources.c1.interceptors.b.hostHeader = host

agent1.channels.ch1.type = memory
agent1.channels.ch1.capacity = 1000
agent1.channels.ch1.transactionCapacity = 1000
agent1.channels.ch1.batchSize = 1000
agent1.channels.ch1.maxFileSize = 2073741824
agent1.channels.ch1.keep-alive = 5
agent1.sinks.c1s1.type = hdfs
agent1.sinks.c1s1.hdfs.path = hdfs://bivm.ibm.com:9000/user/biadmin/
flume/%y-%m-%d/%H%M
agent1.sinks.c1s1.hdfs.fileType = DataStream
agent1.sinks.c1s1.hdfs.filePrefix = %{file}
agent1.sinks.c1s1.hdfs.fileSuffix =.csv
agent1.sinks.c1s1.hdfs.writeFormat = Text
agent1.sinks.c1s1.hdfs.maxOpenFiles = 10
agent1.sinks.c1s1.hdfs.rollSize = 67000000
agent1.sinks.c1s1.hdfs.rollCount = 0
#agent1.sinks.c1s1.hdfs.rollInterval = 0
agent1.sinks.c1s1.hdfs.batchSize = 1000
agent1.sinks.c1s1.channel = ch1
#agent1.sinks.c1s1.hdfs.codeC = snappyCodec
agent1.sinks.c1s1.hdfs.serializer = text
agent1.sinks.c1s1.hdfs.serializer.appendNewline = false

hdfs.serializer.appendNewline没有解决问题 任何人都可以检查并建议..

2 个答案:

答案 0 :(得分:0)

更换水槽剂中的以下一行。

agent1.sinks.c1s1.serializer.appendNewline = false

以下行,让我知道它是怎么回事。

agent1.sinks.c1s1.hdfs.serializer.appendNewline = false

答案 1 :(得分:0)

替换

Position    Start   End Strand  Overhang    Name
1   3798630 3798861 +   .   ENSPFOG0000001
1   3799259 3799404 +   .   ENSPFOG0000001
1   3809992 3810195 +   .   ENSPFOG0000001
1   3810582 3810729 +   .   ENSPFOG0000001
2   4084800 4084866 -   .   ENSPFOG0000002
2   4084466 4084566 -   .   ENSPFOG0000002
2   4084089 4084179 -   .   ENSPFOG0000002

agent1.sinks.c1s1.hdfs.serializer = text
agent1.sinks.c1s1.hdfs.serializer.appendNewline = false

区别在于序列化程序设置未在hdfs前缀上设置,而是直接在接收器名称上设置。

Flume文档应该有一些示例,因为我也遇到了问题,因为我没有发现序列化程序设置在不同级别的属性名称上。

有关Hdfs接收器的更多信息,请访问: https://flume.apache.org/FlumeUserGuide.html#hdfs-sink