如何通过Java代码更改Apache flume的配置文件?

时间:2018-04-03 20:37:24

标签: hadoop flume sentiment-analysis flume-twitter

Iam目前正致力于一个大数据项目,用于对Twitter的热门话题进行情绪分析。我按照cloudera的教程了解如何通过水槽获取Hadoop的推文。

http://blog.cloudera.com/blog/2012/09/analyzing-twitter-data-with-hadoop/

flume.conf:

# Licensed to the Apache Software Foundation (ASF) under one

# or more contributor license agreements. See the NOTICE file

# distributed with this work for additional information

# regarding copyright ownership. The ASF licenses this file

# to you under the Apache License, Version 2.0 (the

# "License"); you may not use this file except in compliance

# with the License. You may obtain a copy of the License at

#

# http://www.apache.org/licenses/LICENSE-2.0

#

# Unless required by applicable law or agreed to in writing,

# software distributed under the License is distributed on an

# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY

# KIND, either express or implied. See the License for the

# specific language governing permissions and limitations

# under the License.



# The configuration file needs to define the sources, 

# the channels and the sinks.

# Sources, channels and sinks are defined per agent, 

# in this case called 'TwitterAgent'


TwitterAgent.sources = Twitter

TwitterAgent.channels = MemChannel

TwitterAgent.sinks = HDFS


TwitterAgent.sources.Twitter.type = com.cloudera.flume.source.TwitterSource

TwitterAgent.sources.Twitter.channels = MemChannel

TwitterAgent.sources.Twitter.consumerKey = 

TwitterAgent.sources.Twitter.consumerSecret = 

TwitterAgent.sources.Twitter.accessToken =  

TwitterAgent.sources.Twitter.accessTokenSecret =  

TwitterAgent.sources.Twitter.keywords = hadoop, big data, analytics, bigdata, cloudera, data science, data scientiest, business intelligence, mapreduce, data warehouse, data warehousing, mahout, hbase, nosql, newsql, businessintelligence, cloudcomputing


TwitterAgent.sinks.HDFS.channel = MemChannel

TwitterAgent.sinks.HDFS.type = hdfs

TwitterAgent.sinks.HDFS.hdfs.path = hdfs://hadoop1:8020/user/flume/tweets/%Y/%m/%d/%H/

TwitterAgent.sinks.HDFS.hdfs.fileType = DataStream

TwitterAgent.sinks.HDFS.hdfs.writeFormat = Text

TwitterAgent.sinks.HDFS.hdfs.batchSize = 1000

TwitterAgent.sinks.HDFS.hdfs.rollSize = 0

TwitterAgent.sinks.HDFS.hdfs.rollCount = 10000


TwitterAgent.channels.MemChannel.type = memory

TwitterAgent.channels.MemChannel.capacity = 10000

TwitterAgent.channels.MemChannel.transactionCapacity = 100

现在将此扩展到我的应用程序我需要在flume的配置文件中的关键字部分有热门话题,我想出了Java代码以获得热门话题,但我现在有一个问题我不知道,如何连接这个代码到flume配置文件或如何在关键字部分添加实时趋势主题的新文件。我在网上搜索了很多,因为我是这个领域的初学者,如果你提供一些信息或者至少其他一些替代方案,它将会有很大的帮助。

1 个答案:

答案 0 :(得分:0)

一个非常有趣的问题..!

我同意@ cricket_007的评论 - 无需重新启动Flume代理即可编辑配置。

由于我还没有看到你的java代码来获取趋势主题的关键字,因此我无法说出多少。但是,根据您提供的信息,我可以想到一个替代方案(或者我应该说是一种解决方法) - 但我自己还没有尝试过。

您可以像这样修改TwitterSource.java类:

public void configure(Context context) {
consumerKey = context.getString(TwitterSourceConstants.CONSUMER_KEY_KEY);
consumerSecret = context.getString(TwitterSourceConstants.CONSUMER_SECRET_KEY);
accessToken = context.getString(TwitterSourceConstants.ACCESS_TOKEN_KEY);
accessTokenSecret = context.getString(TwitterSourceConstants.ACCESS_TOKEN_SECRET_KEY);

//MODIFY THE FOLLOWING PORTION
String keywordString = context.getString(TwitterSourceConstants.KEYWORDS_KEY, "");
if (keywordString.trim().length() == 0) {
    keywords = new String[0];
} else {
  keywords = keywordString.split(",");
  for (int i = 0; i < keywords.length; i++) {
    keywords[i] = keywords[i].trim();
  }
}
//UNTIL THIS POINT

ConfigurationBuilder cb = new ConfigurationBuilder();
cb.setOAuthConsumerKey(consumerKey);
cb.setOAuthConsumerSecret(consumerSecret);
cb.setOAuthAccessToken(accessToken);
cb.setOAuthAccessTokenSecret(accessTokenSecret);
cb.setJSONStoreEnabled(true);
cb.setIncludeEntitiesEnabled(true);

twitterStream = new TwitterStreamFactory(cb.build()).getInstance(); 
}

我已经在上面的注释中添加了您初始化keywordString变量的内容 - 您可以调用您的java代码(我假设它是一个可以返回逗号分隔的关键字字符串的方法)而不是从flume.conf中可用的上下文中提取它(只需删除context.getString()部分)。

除此之外,只需从flume.conf中删除以下语句:

TwitterAgent.sources.Twitter.keywords = hadoop, big data, analytics, bigdata, cloudera, data science, data scientiest, business intelligence, mapreduce, data warehouse, data warehousing, mahout, hbase, nosql, newsql, businessintelligence, cloudcomputing

我希望这会有所帮助。