自定义环境Gym,用于使用DDPG Agent进行步进功能处理

时间:2019-07-08 08:32:32

标签: reinforcement-learning openai-gym reward

我是强化学习的新手,我想使用这种技术来处理音频信号。我构建了一个基本的步进函数,希望可以平放下来进行Gym OpenAI的学习和全面的强化学习。

为此,我使用GoalEnv提供的OpenAI,因为我知道目标是平坦信号。 那就是具有输入和所需信号的图像:

image taken from https://imgur.com/pgdlTWK

步进函数调用 _set_action ,该函数执行 achieved_signal = convolution(input_signal,low_pass_filter) - offset ,low_pass_filter也将截止频率作为输入。 截止频率和偏移是作用于观测以获得输出信号的参数。 设计的奖励函数将输入信号和所需信号之间的帧L2-norm返回到负值,以惩罚较大的范数。

以下是我创建的环境:

def butter_lowpass(cutoff, nyq_freq, order=4):
    normal_cutoff = float(cutoff) / nyq_freq
    b, a = signal.butter(order, normal_cutoff, btype='lowpass')
    return b, a

def butter_lowpass_filter(data, cutoff_freq, nyq_freq, order=4):
    b, a = butter_lowpass(cutoff_freq, nyq_freq, order=order)
    y = signal.filtfilt(b, a, data)
    return y

class `StepSignal(gym.GoalEnv)`:

    def __init__(self, input_signal, sample_rate, desired_signal):
        super(StepSignal, self).__init__()

        self.initial_signal = input_signal
        self.signal = self.initial_signal.copy()
        self.sample_rate = sample_rate
        self.desired_signal = desired_signal
        self.distance_threshold = 10e-1

        max_offset = abs(max( max(self.desired_signal) , max(self.signal))
                 - min( min(self.desired_signal) , min(self.signal)) )

        self.action_space = spaces.Box(low=np.array([10e-4,-max_offset]),\
high=np.array([self.sample_rate/2-0.1,max_offset]), dtype=np.float16)

        obs = self._get_obs()
        self.observation_space = spaces.Dict(dict(
        desired_goal=spaces.Box(-np.inf, np.inf, shape=obs['achieved_goal'].shape, dtype='float32'),
        achieved_goal=spaces.Box(-np.inf, np.inf, shape=obs['achieved_goal'].shape, dtype='float32'),
        observation=spaces.Box(-np.inf, np.inf, shape=obs['observation'].shape, dtype='float32'),
        ))

    def step(self, action):
        range = self.action_space.high - self.action_space.low
        action = range / 2 * (action + 1)
        self._set_action(action)
        obs = self._get_obs()
        done = False

        info = {
                'is_success': self._is_success(obs['achieved_goal'], self.desired_signal),
               }
        reward = -self.compute_reward(obs['achieved_goal'],self.desired_signal)
        return obs, reward, done, info

    def reset(self):
        self.signal = self.initial_signal.copy()
        return self._get_obs()


    def _set_action(self, actions):
        actions = np.clip(actions,a_max=self.action_space.high,a_min=self.action_space.low)
        cutoff = actions[0]
        offset = actions[1]
        print(cutoff, offset)
        self.signal = butter_lowpass_filter(self.signal, cutoff, self.sample_rate/2) - offset

    def _get_obs(self):
        obs = self.signal
        achieved_goal = self.signal
        return {
        'observation': obs.copy(),
        'achieved_goal': achieved_goal.copy(),
        'desired_goal': self.desired_signal.copy(),
        }

    def compute_reward(self, goal_achieved, goal_desired):
        d = np.linalg.norm(goal_desired-goal_achieved)
        return d


    def _is_success(self, achieved_goal, desired_goal):
        d = self.compute_reward(achieved_goal, desired_goal)
        return (d < self.distance_threshold).astype(np.float32)

然后可以将环境实例化为变量,并按照https://openai.com/blog/ingredients-for-robotics-research/(页面结尾)的建议通过FlattenDictWrapper进行展平。

length = 20
sample_rate = 30 # 30 Hz
in_signal_length = 20*sample_rate # 20sec signal
x = np.linspace(0, length, in_signal_length)

# Desired output
y = 3*np.ones(in_signal_length)
# Step signal
in_signal = 0.5*(np.sign(x-5)+9)

env = gym.make('stepsignal-v0', input_signal=in_signal, sample_rate=sample_rate, desired_signal=y)
env = gym.wrappers.FlattenDictWrapper(env, dict_keys=['observation','desired_goal'])
env.reset()

该代理是来自keras-rl的DDPG代理,因为操作可以采用环境中描述的连续action_space中的任何值。 我想知道为什么演员和评论家网需要在 input_shape=(1,) + env.observation_space.shape

中添加其他维度的输入
nb_actions = env.action_space.shape[0]

# Building Actor agent (Policy-net)
actor = Sequential()
actor.add(Flatten(input_shape=(1,) + env.observation_space.shape, name='flatten'))
actor.add(Dense(128))
actor.add(Activation('relu'))
actor.add(Dense(64))
actor.add(Activation('relu'))
actor.add(Dense(nb_actions))
actor.add(Activation('linear'))
actor.summary()

# Building Critic net (Q-net)
action_input = Input(shape=(nb_actions,), name='action_input')
observation_input = Input(shape=(1,) + env.observation_space.shape, name='observation_input')
flattened_observation = Flatten()(observation_input)
x = Concatenate()([action_input, flattened_observation])
x = Dense(128)(x)
x = Activation('relu')(x)
x = Dense(64)(x)
x = Activation('relu')(x)
x = Dense(1)(x)
x = Activation('linear')(x)
critic = Model(inputs=[action_input, observation_input], outputs=x)
critic.summary()

# Building Keras agent
memory = SequentialMemory(limit=2000, window_length=1)
policy = BoltzmannQPolicy()
random_process = OrnsteinUhlenbeckProcess(size=nb_actions, theta=0.6, mu=0, sigma=0.3)
agent = DDPGAgent(nb_actions=nb_actions, actor=actor, critic=critic, critic_action_input=action_input,
                  memory=memory, nb_steps_warmup_critic=2000, nb_steps_warmup_actor=10000,
                  random_process=random_process, gamma=.99, target_model_update=1e-3)
agent.compile(Adam(lr=1e-3, clipnorm=1.), metrics=['mae'])

最后,对代理进行了训练:

filename = 'mem20k_heaviside_flattening'
hist = agent.fit(env, nb_steps=10, visualize=False, verbose=2, nb_max_episode_steps=5)
with open('./history_dqn_test_'+ filename + '.pickle', 'wb') as handle:
        pickle.dump(hist.history, handle, protocol=pickle.HIGHEST_PROTOCOL)
        agent.save_weights('h5f_files/dqn_{}_weights.h5f'.format(filename), overwrite=True)

现在是要抓住的地方:对于我的env的同一实例,座席似乎始终在所有情节中都停留在相同的输出值附近:

image taken from https://imgur.com/kaKhZNF `

累计奖励为负,因为我只允许代理获得负奖励。我使用了https://github.com/openai/gym/blob/master/gym/envs/robotics/fetch_env.py中的示例,它是OpenAI代码的一部分。 在一个情节中,我应该获得各种动作集合,收敛到一个(cutoff_final,offset_final),这将使我的输入阶跃信号接近于我的输出平坦信号,显然不是这种情况。另外,我认为对于连续的情节,我应该采取不同的动作。

1 个答案:

答案 0 :(得分:0)

  

我想知道演员和批评家网为什么需要输入一个具有附加维度的输入,输入形式为input_shape =(1,)+ env.observation_space.shape

我认为GoalEnv的设计考虑了 HER (Hindsight体验重播),因为它将使用observation_space内的“子空间”来学习稀疏的奖励信号(OpenAI网站上有一篇论文解释了 HER 的工作原理)。没看过实现,但是我猜是因为 HER 还处理“目标”参数,所以还需要额外的输入。

由于似乎您没有使用HER(适用于任何非策略算法,包括DQN,DDPG等),因此您应该手工制作信息丰富的奖励功能(奖励不是二进制的,例如,如果达到目标则为1,否则为0 )并使用Env基类。奖励应该在step方法内计算,因为MDP中的奖励是 r(s,a,s`)之类的函数,您可能会拥有所需的所有信息。希望对您有所帮助。