我正在Flume中使用假脱机目录源,内存通道和配置单元接收器设置代理,试图将数据提取到配置单元表中。我查看了其他帖子以获取疑难解答的想法,但最终无法解决该问题,即数据从未加载到配置单元中。
这是我的配置设置:
# flume_hivesink.conf: A single-node Flume configuration
# spooling directory source, memory channel, hive sink
# Name the components on this agent
a1.sources = s1
a1.sinks = k1
a1.channels = c1
# Describe / configure the source
a1.sources.s1.type = spooldir
a1.sources.s1.spoolDir = /opt/apps/apache-flume-1.6.0-bin/home/input
# Describe the sink
a1.sinks.k1.type = hive
a1.sinks.k1.hive.metastore = thrift://mini01:9083
a1.sinks.k1.hive.database = flume_hive
a1.sinks.k1.hive.table = student
a1.sinks.k1.serializer = DELIMITED
a1.sinks.k1.serializer.delimiter = "\t"
a1.sinks.k1.serializer.serdeSeparator = '\t'
a1.sinks.k1.serializer.fieldnames =name,age,class
# Use a channel which buffers events in memory
a1.channels.c1.type = memory
# Bind the source and sink to the channel
a1.sources.s1.channels = c1
a1.sinks.k1.channel = c1
假脱机目录位于路径/opt/apps/apache-flume-1.6.0-bin/home/input
中。想法是让该目录监视新文件。
数据文件(/opt/apps/apache-flume-1.6.0-bin/home/input/abc.txt
)的内容:
jz 18 junior
ty 23 senior
sz 100 junior
hive中学生表的模式为:
hive (flume_hive)> show create table student;
OK
createtab_stmt
CREATE TABLE `student`(
`name` string,
`age` int,
`class` string)
CLUSTERED BY (
age)
INTO 2 BUCKETS
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
LOCATION
'hdfs://nameservice1/user/hive/warehouse/flume_hive.db/student'
TBLPROPERTIES (
'orc.compress'='NONE',
'transient_lastDdlTime'='1547726002')
配置单元配置文件(hive-site.xml)的设置为:
<property>
<name>hive.metastore.warehouse.dir</name>
<value>/user/hive/warehouse</value>
</property>
<property>
<name>hive.cli.print.header</name>
<value>true</value>
</property>
<property>
<name>hive.cli.print.current.db</name>
<value>true</value>
</property>
<property>
<name>hive.metastore.uris</name>
<value>thrift://mini01:9083</value>
</property>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://mini03:3306/metastore?createDatabaseIfNotExist=true</value>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>root</value>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>root</value>
</property>
这是我在蜂巢和水槽中执行流程的工作:
1.我使用以下命令在后台在Hive中启动metastore服务:bin/hive --service metastore
2.接下来,我使用以下命令启动flume_ng
进程:
flume-ng agent -c ./conf -f ./conf/flume_hivesink.conf -n a1 -Dflume.root.logger=INFO,console
这是我在水槽侧获得的输出的一部分:
2019-01-17 23:08:25,089 (conf-file-poller-0) [INFO - org.apache.flume.node.AbstractConfigurationProvider.getConfiguration(AbstractConfigurationProvider.java:114)] Channel c1 connected to [s1, k1]
2019-01-17 23:08:25,102 (conf-file-poller-0) [INFO - org.apache.flume.node.Application.startAllComponents(Application.java:138)] Starting new configuration:{ sourceRunners:{s1=EventDrivenSourceRunner: { source:Spool Directory source s1: { spoolDir: /opt/apps/apache-flume-1.6.0-bin/home/input } }} sinkRunners:{k1=SinkRunner: { policy:org.apache.flume.sink.DefaultSinkProcessor@2034c2b2 counterGroup:{ name:null counters:{} } }} channels:{c1=org.apache.flume.channel.MemoryChannel{name: c1}} }
2019-01-17 23:08:25,134 (conf-file-poller-0) [INFO - org.apache.flume.node.Application.startAllComponents(Application.java:145)] Starting Channel c1
2019-01-17 23:08:25,340 (lifecycleSupervisor-1-0) [INFO - org.apache.flume.instrumentation.MonitoredCounterGroup.register(MonitoredCounterGroup.java:120)] Monitored counter group for type: CHANNEL, name: c1: Successfully registered new MBean.
2019-01-17 23:08:25,341 (lifecycleSupervisor-1-0) [INFO - org.apache.flume.instrumentation.MonitoredCounterGroup.start(MonitoredCounterGroup.java:96)] Component type: CHANNEL, name: c1 started
2019-01-17 23:08:25,342 (conf-file-poller-0) [INFO - org.apache.flume.node.Application.startAllComponents(Application.java:173)] Starting Sink k1
2019-01-17 23:08:25,346 (conf-file-poller-0) [INFO - org.apache.flume.node.Application.startAllComponents(Application.java:184)] Starting Source s1
2019-01-17 23:08:25,348 (lifecycleSupervisor-1-3) [INFO - org.apache.flume.source.SpoolDirectorySource.start(SpoolDirectorySource.java:78)] SpoolDirectorySource source starting with directory: /opt/apps/apache-flume-1.6.0-bin/home/input
2019-01-17 23:08:25,368 (lifecycleSupervisor-1-1) [INFO - org.apache.flume.instrumentation.MonitoredCounterGroup.register(MonitoredCounterGroup.java:120)] Monitored counter group for type: SINK, name: k1: Successfully registered new MBean.
2019-01-17 23:08:25,368 (lifecycleSupervisor-1-1) [INFO - org.apache.flume.instrumentation.MonitoredCounterGroup.start(MonitoredCounterGroup.java:96)] Component type: SINK, name: k1 started
2019-01-17 23:08:25,370 (lifecycleSupervisor-1-1) [INFO - org.apache.flume.sink.hive.HiveSink.start(HiveSink.java:502)] k1: Hive Sink k1 started
2019-01-17 23:08:25,444 (lifecycleSupervisor-1-3) [INFO - org.apache.flume.instrumentation.MonitoredCounterGroup.register(MonitoredCounterGroup.java:120)] Monitored counter group for type: SOURCE, name: s1: Successfully registered new MBean.
2019-01-17 23:08:25,444 (lifecycleSupervisor-1-3) [INFO - org.apache.flume.instrumentation.MonitoredCounterGroup.start(MonitoredCounterGroup.java:96)] Component type: SOURCE, name: s1 started
然后,flume_ng
进程继续旋转,没有错误消息出现。我还再次检查了配置单元表student
,该表中没有更新的数据,如下所示:
hive (flume_hive)> select * from student;
OK
student.name student.age student.class
Time taken: 1.602 seconds
此外,我检查了假脱机目录,abc.txt
的重命名如下所示:
[hadoop@mini01 input]$ pwd
/opt/apps/apache-flume-1.6.0-bin/home/input
[hadoop@mini01 input]$ ls
abc.txt.COMPLETED
最后,我尝试将来源更改为其他来源(例如exec
,http
等),但最终仍然无法将数据加载到hive的Student表中。是否有我遗漏或做错的事情导致数据无法成功导入到蜂巢的学生表中?