我想使用张量流创建代理。我有9种动作类别:向左滚动,向左滚动,向右滚动,刹车...等等。张量流管道的输出为array [9]。在此基础上,我将模拟WSAD的推送组合。但是有时候我想选择随机动作,但不是完全随机,而是基于密集的softmax输出。正是我想要的功能是numpy.random.multinomial。但是tensorflow.random.multinomial仅返回所选操作的索引,而不返回尺寸与输入相同的张量。我试图保存动作,并在教学阶段稍后将它们提供给代理,但是我所基于的示例将需要在播放阶段中提供不需要的动作。 我知道可以使用tensorflow.cond和tesorflow.equal来实现,但是管道看起来像乱七八糟,我不确定性能。换句话说,在Tensorflow函数中是否表现得像numpy.random.multinomial一样,还是有一个原因,为什么没有它并且我的代理的体系结构不正确?
代理商本身:
class agentY():
def __init__(self,lr,s_size,a_size,h_size):
self.state_in = tf.placeholder(shape = [None]+list(s_size),dtype=tf.float32)
conv1 = tf.layers.conv2d(self.state_in,32,4,strides=(4, 4))
max_pool1 = tf.layers.max_pooling2d(conv1,32,4)
flatten = tf.layers.flatten(max_pool1)
hidden = tf.layers.dense(flatten,4096,activation=tf.nn.tanh)
hidden_action = tf.layers.dense(hidden,2048, activation=tf.nn.elu)
self.action = tf.layers.dense(hidden_action,9, activation=tf.nn.softmax)
self.action_in = tf.placeholder(shape =[None,9],dtype=tf.float32, name='acin')
cross_entropy = tf.nn.softmax_cross_entropy_with_logits_v2(labels=self.action_in,
logits=self.action)
optimizer = tf.train.AdamOptimizer(lr)
grads_and_vars = optimizer.compute_gradients(cross_entropy)
self.gradients = [grad for grad, variable in grads_and_vars]
self.gradient_placeholders = []
grads_and_vars_feed = []
for grad, variable in grads_and_vars:
gradient_placeholder = tf.placeholder(tf.float32, shape=grad.get_shape())
self.gradient_placeholders.append(gradient_placeholder)
grads_and_vars_feed.append((gradient_placeholder, variable))
self.training_op = optimizer.apply_gradients(grads_and_vars_feed)
比赛阶段:
state = get_state()
action = sess.run([myAgent.action], feed_dict={myAgent.state_in:[state]}) #
action = numpy.random.multinomial(1,action[0][0])
if do_action:
releaseKeys();
update_pressed_keys(categoriesToKeys(action))
reward = reward + delta_time
current_rewards.append(reward)
current_gradients.append(myAgent.gradients)
教学阶段:
def teach_agent(agent, all_rewards, all_gradients,sess):
rewards = np.array(discount_and_normalize_rewards(all_rewards,0.99))
test = []
feed_dict = {}
for var_index, gradient_placeholder in enumerate(agent.gradient_placeholders):
mean_gradients = np.mean([reward * all_gradients[game_index][step][var_index]
for game_index, rewards in enumerate(all_rewards)
for step, reward in enumerate(rewards)], axis=0)
feed_dict[gradient_placeholder] = mean_gradients
sess.run(agent.training_op, feed_dict=feed_dict)
示教阶段尚未测试。该代码基于《使用Scikit-Learn和TensorFlow进行动手机器学习》一书
答案 0 :(得分:0)
我设法使用具有[None,1]形状的标签和logit运行tensorflow.nn.softmax_cross_entropy_with_logits_v2(),其中该张量是作用指数(类别)。我在GPU端崩溃,意识到我一直都错了,却忘记了最重要的功能之一:单热编码。我使用多项式来计算索引,而不是使用一热编码的结果。下面的示例:
import tensorflow as tf
import numpy as np
p = tf.placeholder(shape = [None,4],dtype=tf.float32)
t = tf.nn.softmax(p)
t1 = tf.random.categorical(tf.log(t),1)
t2 = tf.one_hot(t1, 4,
on_value=1.0, off_value=0.0,
axis=-1)
with tf.Session() as sess:
inArray = [[0.8,0.5,0.1,0.2]]
index, outArray = sess.run([t1,t2],feed_dict={p:inArray})
print("Index:",index)
print("Array:",outArray)
这当然是菜鸟的错误,我是ML的新手,并且很难理解张量流。 现在,Agent看起来像这样:
class agentY():
def __init__(self,lr,s_size,a_size,h_size):
self.state_in = tf.placeholder(shape = [None]+list(s_size),dtype=tf.float32)
conv1 = tf.layers.conv2d(self.state_in,32,4,strides=(4, 4))
max_pool1 = tf.layers.max_pooling2d(conv1,32,4)
flatten = tf.layers.flatten(max_pool1)
hidden = tf.layers.dense(flatten,4096,activation=tf.nn.tanh)
hidden_action = tf.layers.dense(hidden,2048, activation=tf.nn.elu)
self.action_logits = tf.layers.dense(hidden_action,9, activation=tf.nn.softmax)
self.action_out = tf.one_hot(tf.multinomial(self.action_logits,1), 9,on_value=1.0, off_value=0.0,axis=-1)
cross_entropy = tf.nn.softmax_cross_entropy_with_logits_v2(labels=self.action_out,
logits=self.action_logits)
optimizer = tf.train.AdamOptimizer(lr)
grads_and_vars = optimizer.compute_gradients(cross_entropy)
self.gradients = [grad for grad, variable in grads_and_vars]
self.gradient_placeholders = []
grads_and_vars_feed = []
for grad, variable in grads_and_vars:
gradient_placeholder = tf.placeholder(tf.float32, shape=grad.get_shape())
self.gradient_placeholders.append(gradient_placeholder)
grads_and_vars_feed.append((gradient_placeholder, variable))
self.training_op = optimizer.apply_gradients(grads_and_vars_feed)
tf.reset_default_graph()
testAgent = agentY(0.1,(300,400,1),9,11)
现在的问题是,Iam会根据每个Agent的决定进行渐变。这需要大量的RAM,因此我肯定不建议这样做。在下面看看:
while True:
time0 = time.time()
#-----------------zzx
if collectData:
state = get_state()
action_out, gradients = sess.run([myAgent.action_out,myAgent.gradients], feed_dict={myAgent.state_in:[state]})
if do_action:
releaseKeys();
update_pressed_keys(categoriesToKeys(action))
reward = reward + delta_time
current_rewards.append(reward)
current_gradients.append(gradients)
稍后,我将在功能teacher_agent()中使用这些梯度将奖励引入Agent的网络。 (teacher_agent已发布在原始帖子中)。在我回到书中并尝试理解Q-Learning Agent的下一个示例之前,有人(如果可能的话)可以以简单的方式解释Q-Learning或强化学习的其他方式吗?