我已完成日常计算的实施。这是一些伪代码。 " NEWUSER"可以称为第一个激活用户。
// Get today log from hbase or somewhere else
val log = getRddFromHbase(todayDate)
// Compute active user
val activeUser = log.map(line => ((line.uid, line.appId), line).reduceByKey(distinctStrategyMethod)
// Get history user from hdfs
val historyUser = loadFromHdfs(path + yesterdayDate)
// Compute new user from active user and historyUser
val newUser = activeUser.subtractByKey(historyUser)
// Get new history user
val newHistoryUser = historyUser.union(newUser)
// Save today history user
saveToHdfs(path + todayDate)
计算" activeUser"可以轻松转换为火花流。这是一些代码:
val transformedLog = sdkLogDs.map(sdkLog => {
val time = System.currentTimeMillis()
val timeToday = ((time - (time + 3600000 * 8) % 86400000) / 1000).toInt
((sdkLog.appid, sdkLog.bcode, sdkLog.uid), (sdkLog.channel_no, sdkLog.ctime.toInt, timeToday))
})
val activeUser = transformedLog.groupByKeyAndWindow(Seconds(86400), Seconds(60)).mapValues(x => {
var firstLine = x.head
x.foreach(line => {
if (line._2 < firstLine._2) firstLine = line
})
firstLine
})
但是&#34; newUser&#34;和&#34; historyUser&#34;令我困惑。
我认为我的问题可归纳为&#34;如何从流中计算新元素&#34;。正如我上面的伪代码,&#34; newUser&#34;是&#34; activeUser&#34;的一部分。我必须保持一套&#34; historyUser&#34;知道哪个部分是&#34; newUser&#34;。
我认为这是一种方法,但我认为它可能无法正常工作:
将历史记录用户加载为RDD。 Foreach DStream&#34; activeUser&#34;并且发现&#34; historyUser&#34;中存在的元素不存在。 这里的问题是我应该何时更新&#34; historyUser&#34;的RDD?确保我能得到正确的&#34; newUser&#34;窗户。
更新&#34; historyUser&#34; RDD意味着添加&#34; newUser&#34;它。就像我在上面的伪代码中做的那样。 &#34; historyUser&#34;在该代码中每天更新一次。 另一个问题是如何从DStream更新RDD操作。我认为更新&#34; historyUser&#34;当窗口滑动是正确的。但我还没有找到合适的API来做到这一点
那么哪个是解决这个问题的最佳实践。
答案 0 :(得分:0)
updateStateByKey
会有所帮助,因为它允许您设置初始状态(您的历史用户),然后在主流的每个间隔更新它。我把一些代码放在一起来解释这个概念
val historyUsers = loadFromHdfs(path + yesterdayDate).map(UserData(...))
case class UserStatusState(isNew: Boolean, values: UserData)
// this will prepare the RDD of already known historical users
// to pass into updateStateByKey as initial state
val initialStateRDD = historyUsers.map(user => UserStatusState(false, user))
// stateful stream
val trackUsers = sdkLogDs.updateStateByKey(updateState, new HashPartitioner(sdkLogDs.ssc.sparkContext.defaultParallelism), true, initialStateRDD)
// only new users
val newUsersStream = trackUsers.filter(_._2.isNew)
def updateState(newValues: Seq[UserData], prevState: Option[UserStatusState]): Option[UserStatusState] = {
// Group all values for specific user as needed
val groupedUserData: UserData = newValues.reduce(...)
// prevState is defined only for users previously seen in the stream
// or loaded as initial state from historyUsers RDD
// For new users it is None
val isNewUser = !prevState.isDefined
// as you return state here for the user - prevState won't be None on next iterations
Some(UserStatusState(isNewUser, groupedUserData))
}