是否可以在不使用Estimator API的情况下在TPU上运行python tensorflow代码?

时间:2019-05-15 17:04:51

标签: tensorflow reinforcement-learning tensorflow-estimator tpu

我花了数周的时间试图编写可直接与TPU通信的Python级Tensorflow代码。在没有Estimator API的情况下,如何实现可以在TPU上运行的系统?

我尝试过的资源:

我尝试的方法:

  • 初始化TPUClusterResolver并将其作为tf.Session()的参数传递,它只是挂起而不执行session.run()

  • 还尝试了sess.run(tpu.initialize_system()),并且卡住了

  • 试图在那里查看TPUEstimator API

def train_model(self, env, episodes=100, 
                    load_model = False,  # load model from checkpoint if available:?
                    model_dir = '/tmp/pgmodel/', log_freq=10 ) :

        # initialize variables and load model
        init_op = tf.global_variables_initializer()
        self._sess.run(init_op)
        if load_model:
            ckpt = tf.train.get_checkpoint_state(model_dir)
            print tf.train.latest_checkpoint(model_dir)
            if ckpt and ckpt.model_checkpoint_path:
                savr = tf.train.import_meta_graph(ckpt.model_checkpoint_path+'.meta')
                out = savr.restore(self._sess, ckpt.model_checkpoint_path)
                print("Model restored from ",ckpt.model_checkpoint_path)
            else:
                print('No checkpoint found at: ',model_dir)
        if not os.path.exists(model_dir):
            os.makedirs(model_dir)

        episode = 0
        observation = env.reset()
        xs,rs,ys = [],[],[]    # environment info
        running_reward = 0    
        reward_sum = 0
        # training loop
        day = 0
        simrors = np.zeros(episodes)
        mktrors = np.zeros(episodes)
        alldf = None
        victory = False
        while episode < episodes and not victory:
            # stochastically sample a policy from the network
            x = observation
            feed = {self._tf_x: np.reshape(x, (1,-1))}
            aprob = self._sess.run(self._tf_aprob,feed)
            aprob = aprob[0,:] # we live in a batched world :/

            action = np.random.choice(self._num_actions, p=aprob)
            label = np.zeros_like(aprob) ; label[action] = 1 # make a training 'label'

            # step the environment and get new measurements
            observation, reward, done, info = env.step(action)
            #print observation, reward, done, info
            reward_sum += reward

            # record game history
            xs.append(x)
            ys.append(label)
            rs.append(reward)
            day += 1
            if done:
                running_reward = running_reward * 0.99 + reward_sum * 0.01
                epx = np.vstack(xs)
                epr = np.vstack(rs)
                epy = np.vstack(ys)
                xs,rs,ys = [],[],[] # reset game history
                df = env.env.sim.to_df()
                #pdb.set_trace()
                simrors[episode]=df.bod_nav.values[-1]-1 # compound returns
                mktrors[episode]=df.mkt_nav.values[-1]-1

                alldf = df if alldf is None else pd.concat([alldf,df], axis=0)

                feed = {self._tf_x: epx, self._tf_epr: epr, self._tf_y: epy}
                _ = self._sess.run(self._train_op,feed) # parameter update

                if episode % log_freq == 0:
                    log.info('year #%6d, mean reward: %8.4f, sim ret: %8.4f, mkt ret: %8.4f, net: %8.4f', episode,
                             running_reward, simrors[episode],mktrors[episode], simrors[episode]-mktrors[episode])
                    save_path = self._saver.save(self._sess, model_dir+'model.ckpt',
                                                 global_step=episode+1)
                    if episode > 100:
                        vict = pd.DataFrame( { 'sim': simrors[episode-100:episode],
                                               'mkt': mktrors[episode-100:episode] } )
                        vict['net'] = vict.sim - vict.mkt
                        if vict.net.mean() > 0.0:
                            victory = True
                            log.info('Congratulations, Warren Buffet!  You won the trading game.')
                    #print("Model saved in file: {}".format(save_path))



                episode += 1
                observation = env.reset()
                reward_sum = 0
                day = 0

        return alldf, pd.DataFrame({'simror':simrors,'mktror':mktrors})

Estimator API实现存在的问题:

  • 我有一个基于策略梯度的强化学习代码,其中包含一个神经网络
  • 我在执行期间有两个session.run()。每集的每一步都在运行。另一集在剧集结束时运行
  • tf.train.SessionRunHook不适合我的代码

0 个答案:

没有答案