如何使用Tensorflow Optimizer而不重新计算在每次迭代后返回控制的强化学习程序中的激活?


编辑(1/3/16):corresponding github issue

我正在使用Tensorflow(Python接口)来实现一个q-learning代理,其函数逼近使用随机梯度下降进行训练 . 在实验的每次迭代中,调用代理中的步骤函数,其基于新的奖励和激活来更新近似的参数,然后选择要执行的新动作 .


  • 代理计算其状态 - 操作值预测以选择操作 .

  • 然后控制另一个程序,它模拟环境中的一个步骤 .

  • 现在是代理's step function is called for the next iteration. I want to use Tensorflow'的Optimizer类来为我计算渐变 . 但是,这需要我计算最后一步的状态 - 动作值预测和它们的图形 . 所以:

  • 如果我在整个图上运行优化器,那么它必须重新计算状态 - 动作值预测 .

  • 但是,如果我将预测(对于所选操作)存储为变量,然后将其作为占位符提供给优化器,它不再具有计算渐变所需的图形 .

  • 我不能只在同一个sess.run()语句中运行它,因为我必须放弃控制并返回所选择的动作以获得下一个观察和奖励(用于目标中的损失函数) ) .


  • 计算我的图形的一部分,返回value1 .

  • 将值1返回到调用程序以计算value2

  • 在下一次迭代中,使用value2作为渐变下降的损失函数的一部分,而不重新计算计算value1的图形部分 .


  • 只需对渐变进行硬编码:对于我现在使用的非常简单的逼近器来说这很容易,但如果我在一个大的卷积网络中尝试不同的滤波器和激活函数,那将非常不方便 . 如果可能的话,我真的很想使用Optimizer类 .

  • 从代理内部调用环境模拟:This system这样做,但这会使我更复杂,并删除了很多模块化和结构 . 所以,我不想这样做 .

我已多次阅读API和白皮书,但似乎无法提出解决方案 . 我试图想出一些方法将目标输入图形来计算梯度,但是无法想出一种自动构建图形的方法 .

如果事实证明这在TensorFlow中是不可能的,你认为将它作为一个新的运算符来实现它会非常复杂吗? (我在几年内没有使用过C,所以TensorFlow源看起来有点令人生畏 . )或者我会更好地切换到像Torch这样具有强制性差异Autograd,而不是象征性差异的东西?

感谢您抽出宝贵时间帮助我解决这个问题 . 我试图尽可能地简洁 .

编辑:经过进一步搜索后,我遇到了this previously asked question . 它's a little different than mine (they are trying to avoid updating an LSTM network twice every iteration in Torch), and doesn' t还有任何答案 .


-Q-Learning agent for a grid-world environment.
-Receives input as raw rbg pixel representation of screen.
-Uses an artificial neural network function approximator with one hidden layer

2015 Jonathon Byrd

import random
import sys
#import copy
from rlglue.agent.Agent import Agent
from rlglue.agent import AgentLoader as AgentLoader
from rlglue.types import Action
from rlglue.types import Observation

import tensorflow as tf
import numpy as np

world_size = (3,3)
total_spaces = world_size[0] * world_size[1]

class simple_agent(Agent):

    discount_factor = tf.constant(0.5, name="discount_factor")
    learning_rate = tf.constant(0.01, name="learning_rate")
    exploration_rate = tf.Variable(0.2, name="exploration_rate")  # used to be a constant :P
    hidden_layer_size = 12

    #Network Parameters - weights and biases
    W = [tf.Variable(tf.truncated_normal([total_spaces * 3, hidden_layer_size], stddev=0.1), name="layer_1_weights"), 
    tf.Variable(tf.truncated_normal([hidden_layer_size,4], stddev=0.1), name="layer_2_weights")]
    b = [tf.Variable(tf.zeros([hidden_layer_size]), name="layer_1_biases"), tf.Variable(tf.zeros([4]), name="layer_2_biases")]

    #Input placeholders - observation and reward
    screen = tf.placeholder(tf.float32, shape=[1, total_spaces * 3], name="observation") #input pixel rgb values
    reward = tf.placeholder(tf.float32, shape=[], name="reward")

    #last step data
    last_obs = np.array([1, 2, 3], ndmin=4)
    last_act = -1

    #Last step placeholders
    last_screen = tf.placeholder(tf.float32, shape=[1, total_spaces * 3], name="previous_observation")
    last_move = tf.placeholder(tf.int32, shape = [], name="previous_action")

    next_prediction = tf.placeholder(tf.float32, shape = [], name="next_prediction")

    step_count = 0

    def __init__(self):
        #Initialize computational graphs
        self.q_preds = self.Q(self.screen)
        self.last_q_preds = self.Q(self.last_screen)
        self.action = self.choose_action(self.q_preds)
        self.next_pred = self.max_q(self.q_preds)
        self.last_pred = self.act_to_pred(self.last_move, self.last_q_preds) # inefficient recomputation
        self.loss = self.error(self.last_pred, self.reward, self.next_prediction)
        self.train = self.learn(self.loss)
        #Summaries and Statistics
        tf.scalar_summary(['loss'], self.loss)
        tf.scalar_summary('reward', self.reward)
        #w_hist = tf.histogram_summary("weights", self.W[0])
        self.summary_op = tf.merge_all_summaries()
        self.sess = tf.Session()
        self.summary_writer = tf.train.SummaryWriter('tensorlogs', graph_def=self.sess.graph_def)

    def agent_init(self,taskSpec):
        print("agent_init called")

    def agent_start(self,observation):
        #print("agent_start called, observation = {0}".format(observation.intArray))
        o = np.divide(np.reshape(np.asarray(observation.intArray), (1,total_spaces * 3)), 255)
        return self.control(o)

    def agent_step(self,reward, observation):
        #print("agent_step called, observation = {0}".format(observation.intArray))
        print("step, reward: {0}".format(reward))
        o = np.divide(np.reshape(np.asarray(observation.intArray), (1,total_spaces * 3)), 255)

        next_prediction = self.sess.run([self.next_pred], feed_dict={self.screen:o})[0]

        if self.step_count % 10 == 0:
            summary_str = self.sess.run([self.summary_op, self.train], 
                feed_dict={self.reward:reward, self.last_screen:self.last_obs, 
                self.last_move:self.last_act, self.next_prediction:next_prediction})[0]

            self.summary_writer.add_summary(summary_str, global_step=self.step_count)
                feed_dict={self.screen:o, self.reward:reward, self.last_screen:self.last_obs, 
                self.last_move:self.last_act, self.next_prediction:next_prediction})

        return self.control(o)

    def control(self, observation):
        results = self.sess.run([self.action], feed_dict={self.screen:observation})
        action = results[0]

        self.last_act = action
        self.last_obs = observation

        if (action==0):  # convert action integer to direction character
            action = 'u'
        elif (action==1):
            action = 'l'
        elif (action==2):
            action = 'r'
        elif (action==3):
            action = 'd'
        #print("return action returned {0}".format(action))
        self.step_count += 1
        return returnAction

    def Q(self, obs):  #calculates state-action value prediction with feed-forward neural net
        with tf.name_scope('network_inference') as scope:
            h1 = tf.nn.relu(tf.matmul(obs, self.W[0]) + self.b[0])
            q_preds = tf.matmul(h1, self.W[1]) + self.b[1] #linear activation
            return tf.reshape(q_preds, shape=[4])

    def choose_action(self, q_preds):  #chooses action epsilon-greedily
        with tf.name_scope('action_choice') as scope:
            exploration_roll = tf.random_uniform([])
            #greedy_action = tf.argmax(q_preds, 0)  # gets the action with the highest predicted Q-value
            #random_action = tf.cast(tf.floor(tf.random_uniform([], maxval=4.0)), tf.int64)

            #exploration rate updates
            #if self.step_count % 10000 == 0:
                #self.exploration_rate.assign(tf.div(self.exploration_rate, 2))

            return tf.select(tf.greater_equal(exploration_roll, self.exploration_rate), 
                tf.argmax(q_preds, 0),   #greedy_action
                tf.cast(tf.floor(tf.random_uniform([], maxval=4.0)), tf.int64))  #random_action

        Why does this return NoneType?:

        flag = tf.select(tf.greater_equal(exploration_roll, self.exploration_rate), 'g', 'r')
        if flag == 'g':  #greedy
            return tf.argmax(q_preds, 0) # gets the action with the highest predicted Q-value
        elif flag == 'r':  #random
            return tf.cast(tf.floor(tf.random_uniform([], maxval=4.0)), tf.int64)

    def error(self, last_pred, r, next_pred):
        with tf.name_scope('loss_function') as scope:
            y = tf.add(r, tf.mul(self.discount_factor, next_pred)) #target
            return tf.square(tf.sub(y, last_pred)) #squared difference error

    def learn(self, loss): #Update parameters using stochastic gradient descent
        #TODO:  Either figure out how to avoid computing the q-prediction twice or just hardcode the gradients.
        with tf.name_scope('train') as scope:
            return tf.train.GradientDescentOptimizer(self.learning_rate).minimize(loss, var_list=[self.W[0], self.W[1], self.b[0], self.b[1]])

    def max_q(self, q_preds):
        with tf.name_scope('greedy_estimate') as scope:
            return tf.reduce_max(q_preds)  #best predicted action from current state

    def act_to_pred(self, a, preds): #get the value prediction for action a
        with tf.name_scope('get_prediction') as scope:
            return tf.slice(preds, tf.reshape(a, shape=[1]), [1])

    def agent_end(self,reward):

    def agent_cleanup(self):

    def agent_message(self,inMessage):
        if inMessage=="what is your name?":
            return "my name is simple_agent";
            return "I don't know how to respond to your message";

if __name__=="__main__":

1 回答

  • 14

    现在你想做的事情在Tensorflow(0.6)中非常困难 . 最好的办法是咬掉子弹并多次调用运行,但需要重新计算激活次数 . 但是,我们内部非常清楚这个问题 . 原型“部分运行”解决方案正在开发中,但目前尚无时间表完成 . 由于一个真正令人满意的答案可能需要修改tensorflow本身,你也可以为此制作一个github问题,看看是否有其他人对此有任何意见 .

