我想知道是否可以存储一堆字符串,例如,在带有Apache Spark的RAM中。实际上,我想根据Apache Spark正在处理的新输入数据来查询和更新这些字符串。此外,如果可能,节点是否可以通知所有其他节点存储哪些字符串? 如果您需要有关我的项目的信息,请随时提问。
Ĵ
答案 0 :(得分:1)
是的,您需要有状态流功能mapWithState
。此功能允许您跨流批处理更新内存中缓存的状态。
请注意,如果您尚未启用检查点,则需要启用检查点。
Scala示例用法:
def stateUpdateFunction(userId: UserId,
newData: UserAction,
stateData: State[UserSession]): UserModel = {
val currentSession = stateData.get() // Get current session data
val updatedSession = ... // Compute updated session using newData
stateData.update(updatedSession) // Update session data
val userModel = ... // Compute model using updatedSession
return userModel // Send model downstream
}
// Stream of user actions, keyed by the user ID
val userActions = ... // stream of key-value tuples of (UserId, UserAction)
// Stream of data to commit
val userModels = userActions.mapWithState(StateSpec.function(stateUpdateFunction))
Java示例用法:
// Update the cumulative count function
Function3<String, Optional<Integer>, State<Integer>, Tuple2<String, Integer>> mappingFunc =
new Function3<String, Optional<Integer>, State<Integer>, Tuple2<String, Integer>>() {
@Override
public Tuple2<String, Integer> call(String word, Optional<Integer> one,
State<Integer> state) {
int sum = one.orElse(0) + (state.exists() ? state.get() : 0);
Tuple2<String, Integer> output = new Tuple2<>(word, sum);
state.update(sum);
return output;
}
};
// DStream made of get cumulative counts that get updated in every batch
JavaMapWithStateDStream<String, Integer, Integer, Tuple2<String, Integer>> stateDstream =
wordsDstream.mapWithState(StateSpec.function(mappingFunc).initialState(initialRDD));