使用pyspark将map转换为mapPartition

时间:2017-02-08 11:54:15

标签: python tensorflow pyspark

我正在尝试从磁盘加载张量流模型并预测值。

代码

def get_value(row):
    print("**********************************************")
    graph = tf.Graph()
    rowkey = row[0]
    checkpoint_file = "/home/sahil/Desktop/Relation_Extraction/data/1485336002/checkpoints/model-300"
    print("Loading model................................")
    with graph.as_default():
        session_conf = tf.ConfigProto(
            allow_soft_placement=allow_soft_placement,
            log_device_placement=log_device_placement)
        sess = tf.Session(config=session_conf)
        with sess.as_default():
            # Load the saved meta graph and restore variables
            saver = tf.train.import_meta_graph("{}.meta".format(checkpoint_file))
            saver.restore(sess, checkpoint_file)
            input_x = graph.get_operation_by_name("X_train").outputs[0]
            dropout_keep_prob = graph.get_operation_by_name("dropout_keep_prob").outputs[0]
            predictions = graph.get_operation_by_name("output/predictions").outputs[0]
            batch_predictions = sess.run(predictions, {input_x: [row[1]], dropout_keep_prob: 1.0})
            print(batch_predictions)
            return (rowkey, batch_predictions)

我有一个RDD,它由一个元组(rowkey,input_vector)组成。我想使用加载的模型来预测输入的分数/类。

调用get_value()的代码

result = data_rdd.map(lambda iter: get_value(iter))
result.foreach(print)

问题在于每次调用地图时,每次为每个元组加载模型都会花费很多时间。

我正在考虑使用 mapPartitions 加载模型,然后使用map调用 get_value 函数。 我不知道如何将代码转换为mapPartition,我在每个parition中只加载一次tensorflow模型并减少运行时间。

提前致谢。

2 个答案:

答案 0 :(得分:2)

我不确定我是否正确地提出了您的问题,但我们可以在此处优化您的代码。

graph = tf.Graph()

checkpoint_file = "/home/sahil/Desktop/Relation_Extraction/data/1485336002/checkpoints/model-300"

with graph.as_default():
        session_conf = tf.ConfigProto(
            allow_soft_placement=allow_soft_placement,
            log_device_placement=log_device_placement)
        sess = tf.Session(config=session_conf)

s = sess.as_default()
saver = tf.train.import_meta_graph("{}.meta".format(checkpoint_file))
saver.restore(sess, checkpoint_file)


input_x = graph.get_operation_by_name("X_train").outputs[0]
dropout_keep_prob = graph.get_operation_by_name("dropout_keep_prob").outputs[0]
predictions = graph.get_operation_by_name("output/predictions").outputs[0]

session_pickle = cPickle.dumps(sess)

def get_value(key, vector, session_pickle):
    sess = cPickle.loads(session_pickle)
    rowkey = key
    batch_predictions = sess.run(predictions, {input_x: [vector], dropout_keep_prob: 1.0})
    print(batch_predictions)
    return (rowkey, batch_predictions



result = data_rdd.map(lambda (key, row): get_value(key=key, vector = row ,  session_pickle = session_pickle))
result.foreach(print)

所以你可以序列化你的tensorflow会话。虽然我没有在这里测试你的代码。运行此并发表评论。

答案 1 :(得分:1)

我猜以下代码是一个巨大的改进,因为它使用mapPartitions。

代码

def predict(rows):
    graph = tf.Graph()
    checkpoint_file = "/home/sahil/Desktop/Relation_Extraction/data/1485336002/checkpoints/model-300"
    print("Loading model................................")
    with graph.as_default():
        session_conf = tf.ConfigProto(
            allow_soft_placement=allow_soft_placement,
            log_device_placement=log_device_placement)
        sess = tf.Session(config=session_conf)
        with sess.as_default():
            # Load the saved meta graph and restore variables
            saver = tf.train.import_meta_graph("{}.meta".format(checkpoint_file))
            saver.restore(sess, checkpoint_file)
        print("**********************************************")
        # Get the placeholders from the graph by name
        input_x = graph.get_operation_by_name("X_train").outputs[0]
        dropout_keep_prob = graph.get_operation_by_name("dropout_keep_prob").outputs[0]
        # Tensors we want to evaluate
        predictions = graph.get_operation_by_name("output/predictions").outputs[0]

        # Generate batches for one epoch
        for row in rows:
            X_test = [row[1]]
            batch_predictions = sess.run(predictions, {input_x: X_test, dropout_keep_prob: 
            yield (row[0], batch_predictions)


result = data_rdd.mapPartitions(lambda iter: predict(iter))
result.foreach(print)