编辑(1/3/16):相应的github问题
我正在使用Tensorflow(Python接口)来实现一个q-learning代理,其函数逼近使用随机梯度下降进行训练.在实验的每次迭代中,调用代理中的步骤函数,其基于新的奖励和激活来更新近似器的参数,然后选择要执行的新动作.
这是问题(加强学习术语):
代理计算其状态 - 动作值预测以选择动作.
然后控制另一个程序,它模拟环境中的一个步骤.
现在调用代理程序的step函数进行下一次迭代.我想使用Tensorflow的Optimizer类为我计算渐变.但是,这需要我计算最后一步的状态 - 动作值预测和它们的图形.所以:
如果我在整个图上运行优化器,那么它必须重新计算状态 - 动作值预测.
但是,如果我将预测(对于所选操作)存储为变量,然后将其作为占位符提供给优化器,则它不再具有计算渐变所需的图形.
我不能只在同一个sess.run()语句中运行它,因为我必须放弃控制并返回所选的动作以获得下一个观察和奖励(在目标中使用损失函数) .
那么,有没有办法可以(没有强化学习行话):
计算我的图形的一部分,返回value1.
将value1返回给调用程序以计算value2
在下一次迭代中,使用value2作为渐变下降的损失函数的一部分,而不重新计算计算value1的图形部分.
当然,我考虑过明显的解决方案:
只需对渐变进行硬编码:对于我现在使用的非常简单的逼近器来说这很容易,但如果我在一个大的卷积网络中尝试不同的滤波器和激活函数,那将非常不方便.如果可能的话,我真的很想使用Optimizer类.
从代理内部调用环境模拟: 这个系统做到了这一点,但它会使我更复杂,并删除了很多模块化和结构.所以,我不想这样做.
我已多次阅读API和白皮书,但似乎无法提出解决方案.我试图想出一些方法将目标输入图形来计算梯度,但是无法想出一种自动构建图形的方法.
如果事实证明这在TensorFlow中是不可能的,你认为将它作为一个新的运算符来实现它会非常复杂吗?(我在几年内没有使用C++,所以TensorFlow源看起来有点令人生畏.)或者我会更好地切换到像Torch这样具有强制性差异Autograd,而不是象征性差异的东西?
感谢您抽出宝贵时间帮助我解决这个问题.我试图尽可能地简洁.
编辑:在做了一些进一步的搜索后,我遇到了这个先前提出的问题.它与我的有点不同(他们试图避免在Torch的每次迭代中两次更新LSTM网络),并且还没有任何答案.
如果有帮助,这里有一些代码:
''' -Q-Learning agent for a grid-world environment. -Receives input as raw rbg pixel representation of screen. -Uses an artificial neural network function approximator with one hidden layer 2015 Jonathon Byrd ''' import random import sys #import copy from rlglue.agent.Agent import Agent from rlglue.agent import AgentLoader as AgentLoader from rlglue.types import Action from rlglue.types import Observation import tensorflow as tf import numpy as np world_size = (3,3) total_spaces = world_size[0] * world_size[1] class simple_agent(Agent): #Contants discount_factor = tf.constant(0.5, name="discount_factor") learning_rate = tf.constant(0.01, name="learning_rate") exploration_rate = tf.Variable(0.2, name="exploration_rate") # used to be a constant :P hidden_layer_size = 12 #Network Parameters - weights and biases W = [tf.Variable(tf.truncated_normal([total_spaces * 3, hidden_layer_size], stddev=0.1), name="layer_1_weights"), tf.Variable(tf.truncated_normal([hidden_layer_size,4], stddev=0.1), name="layer_2_weights")] b = [tf.Variable(tf.zeros([hidden_layer_size]), name="layer_1_biases"), tf.Variable(tf.zeros([4]), name="layer_2_biases")] #Input placeholders - observation and reward screen = tf.placeholder(tf.float32, shape=[1, total_spaces * 3], name="observation") #input pixel rgb values reward = tf.placeholder(tf.float32, shape=[], name="reward") #last step data last_obs = np.array([1, 2, 3], ndmin=4) last_act = -1 #Last step placeholders last_screen = tf.placeholder(tf.float32, shape=[1, total_spaces * 3], name="previous_observation") last_move = tf.placeholder(tf.int32, shape = [], name="previous_action") next_prediction = tf.placeholder(tf.float32, shape = [], name="next_prediction") step_count = 0 def __init__(self): #Initialize computational graphs self.q_preds = self.Q(self.screen) self.last_q_preds = self.Q(self.last_screen) self.action = self.choose_action(self.q_preds) self.next_pred = self.max_q(self.q_preds) self.last_pred = self.act_to_pred(self.last_move, self.last_q_preds) # inefficient recomputation self.loss = self.error(self.last_pred, self.reward, self.next_prediction) self.train = self.learn(self.loss) #Summaries and Statistics tf.scalar_summary(['loss'], self.loss) tf.scalar_summary('reward', self.reward) #w_hist = tf.histogram_summary("weights", self.W[0]) self.summary_op = tf.merge_all_summaries() self.sess = tf.Session() self.summary_writer = tf.train.SummaryWriter('tensorlogs', graph_def=self.sess.graph_def) def agent_init(self,taskSpec): print("agent_init called") self.sess.run(tf.initialize_all_variables()) def agent_start(self,observation): #print("agent_start called, observation = {0}".format(observation.intArray)) o = np.divide(np.reshape(np.asarray(observation.intArray), (1,total_spaces * 3)), 255) return self.control(o) def agent_step(self,reward, observation): #print("agent_step called, observation = {0}".format(observation.intArray)) print("step, reward: {0}".format(reward)) o = np.divide(np.reshape(np.asarray(observation.intArray), (1,total_spaces * 3)), 255) next_prediction = self.sess.run([self.next_pred], feed_dict={self.screen:o})[0] if self.step_count % 10 == 0: summary_str = self.sess.run([self.summary_op, self.train], feed_dict={self.reward:reward, self.last_screen:self.last_obs, self.last_move:self.last_act, self.next_prediction:next_prediction})[0] self.summary_writer.add_summary(summary_str, global_step=self.step_count) else: self.sess.run([self.train], feed_dict={self.screen:o, self.reward:reward, self.last_screen:self.last_obs, self.last_move:self.last_act, self.next_prediction:next_prediction}) return self.control(o) def control(self, observation): results = self.sess.run([self.action], feed_dict={self.screen:observation}) action = results[0] self.last_act = action self.last_obs = observation if (action==0): # convert action integer to direction character action = 'u' elif (action==1): action = 'l' elif (action==2): action = 'r' elif (action==3): action = 'd' returnAction=Action() returnAction.charArray=[action] #print("return action returned {0}".format(action)) self.step_count += 1 return returnAction def Q(self, obs): #calculates state-action value prediction with feed-forward neural net with tf.name_scope('network_inference') as scope: h1 = tf.nn.relu(tf.matmul(obs, self.W[0]) + self.b[0]) q_preds = tf.matmul(h1, self.W[1]) + self.b[1] #linear activation return tf.reshape(q_preds, shape=[4]) def choose_action(self, q_preds): #chooses action epsilon-greedily with tf.name_scope('action_choice') as scope: exploration_roll = tf.random_uniform([]) #greedy_action = tf.argmax(q_preds, 0) # gets the action with the highest predicted Q-value #random_action = tf.cast(tf.floor(tf.random_uniform([], maxval=4.0)), tf.int64) #exploration rate updates #if self.step_count % 10000 == 0: #self.exploration_rate.assign(tf.div(self.exploration_rate, 2)) return tf.select(tf.greater_equal(exploration_roll, self.exploration_rate), tf.argmax(q_preds, 0), #greedy_action tf.cast(tf.floor(tf.random_uniform([], maxval=4.0)), tf.int64)) #random_action ''' Why does this return NoneType?: flag = tf.select(tf.greater_equal(exploration_roll, self.exploration_rate), 'g', 'r') if flag == 'g': #greedy return tf.argmax(q_preds, 0) # gets the action with the highest predicted Q-value elif flag == 'r': #random return tf.cast(tf.floor(tf.random_uniform([], maxval=4.0)), tf.int64) ''' def error(self, last_pred, r, next_pred): with tf.name_scope('loss_function') as scope: y = tf.add(r, tf.mul(self.discount_factor, next_pred)) #target return tf.square(tf.sub(y, last_pred)) #squared difference error def learn(self, loss): #Update parameters using stochastic gradient descent #TODO: Either figure out how to avoid computing the q-prediction twice or just hardcode the gradients. with tf.name_scope('train') as scope: return tf.train.GradientDescentOptimizer(self.learning_rate).minimize(loss, var_list=[self.W[0], self.W[1], self.b[0], self.b[1]]) def max_q(self, q_preds): with tf.name_scope('greedy_estimate') as scope: return tf.reduce_max(q_preds) #best predicted action from current state def act_to_pred(self, a, preds): #get the value prediction for action a with tf.name_scope('get_prediction') as scope: return tf.slice(preds, tf.reshape(a, shape=[1]), [1]) def agent_end(self,reward): pass def agent_cleanup(self): self.sess.close() pass def agent_message(self,inMessage): if inMessage=="what is your name?": return "my name is simple_agent"; else: return "I don't know how to respond to your message"; if __name__=="__main__": AgentLoader.loadAgent(simple_agent())
gdahl.. 14
现在你在Tensorflow(0.6)中想要做的事情非常困难.最好的办法是咬掉子弹并多次调用运行,但需要重新计算激活次数.但是,我们内部非常清楚这个问题.原型"部分运行"解决方案正在开发中,但目前尚无时间表完成.由于一个真正令人满意的答案可能需要修改tensorflow本身,你也可以为此制作一个github问题,看看是否有其他人对此有任何意见.
编辑:现在支持partial_run的实验支持.https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/client/session.py#L317
现在你在Tensorflow(0.6)中想要做的事情非常困难.最好的办法是咬掉子弹并多次调用运行,但需要重新计算激活次数.但是,我们内部非常清楚这个问题.原型"部分运行"解决方案正在开发中,但目前尚无时间表完成.由于一个真正令人满意的答案可能需要修改tensorflow本身,你也可以为此制作一个github问题,看看是否有其他人对此有任何意见.
编辑:现在支持partial_run的实验支持.https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/client/session.py#L317