我可以使用Apache Spark将数据存储在RAM中吗?

时间:2016-11-24 13:35:03

标签: apache-spark spark-streaming

我想知道是否可以存储一堆字符串,例如,在带有Apache Spark的RAM中。实际上,我想根据Apache Spark正在处理的新输入数据来查询和更新这些字符串。此外,如果可能,节点是否可以通知所有其他节点存储哪些字符串? 如果您需要有关我的项目的信息,请随时提问。

Ĵ

1 个答案:

答案 0 :(得分:1)

是的,您需要有状态流功能mapWithState。此功能允许您跨流批处理更新内存中缓存的状态。

请注意,如果您尚未启用检查点,则需要启用检查点。

Scala示例用法:

def stateUpdateFunction(userId: UserId,
                        newData: UserAction,
                        stateData: State[UserSession]): UserModel = {
    val currentSession = stateData.get()    // Get current session data
    val updatedSession = ...            // Compute updated session using newData
    stateData.update(updatedSession)            // Update session data     
    val userModel = ...                 // Compute model using updatedSession
    return userModel                // Send model downstream
}

// Stream of user actions, keyed by the user ID
val userActions = ...  // stream of key-value tuples of (UserId, UserAction)
// Stream of data to commit
val userModels = userActions.mapWithState(StateSpec.function(stateUpdateFunction))

https://databricks.com/blog/2016/02/01/faster-stateful-stream-processing-in-apache-spark-streaming.html

Java示例用法:

// Update the cumulative count function
Function3<String, Optional<Integer>, State<Integer>, Tuple2<String, Integer>> mappingFunc =
    new Function3<String, Optional<Integer>, State<Integer>, Tuple2<String, Integer>>() {
      @Override
      public Tuple2<String, Integer> call(String word, Optional<Integer> one,
          State<Integer> state) {
        int sum = one.orElse(0) + (state.exists() ? state.get() : 0);
        Tuple2<String, Integer> output = new Tuple2<>(word, sum);
        state.update(sum);
        return output;
      }
    };

// DStream made of get cumulative counts that get updated in every batch
JavaMapWithStateDStream<String, Integer, Integer, Tuple2<String, Integer>> stateDstream =
    wordsDstream.mapWithState(StateSpec.function(mappingFunc).initialState(initialRDD));

https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/streaming/JavaStatefulNetworkWordCount.java第90行: