我是强化学习的新手,我想使用这种技术来处理音频信号。我构建了一个基本的步进函数,希望可以平放下来进行Gym OpenAI
的学习和全面的强化学习。
为此,我使用GoalEnv
提供的OpenAI
,因为我知道目标是平坦信号。
那就是具有输入和所需信号的图像:
步进函数调用 _set_action
,该函数执行 achieved_signal = convolution(input_signal,low_pass_filter) - offset
,low_pass_filter也将截止频率作为输入。
截止频率和偏移是作用于观测以获得输出信号的参数。
设计的奖励函数将输入信号和所需信号之间的帧L2-norm
返回到负值,以惩罚较大的范数。
以下是我创建的环境:
def butter_lowpass(cutoff, nyq_freq, order=4):
normal_cutoff = float(cutoff) / nyq_freq
b, a = signal.butter(order, normal_cutoff, btype='lowpass')
return b, a
def butter_lowpass_filter(data, cutoff_freq, nyq_freq, order=4):
b, a = butter_lowpass(cutoff_freq, nyq_freq, order=order)
y = signal.filtfilt(b, a, data)
return y
class `StepSignal(gym.GoalEnv)`:
def __init__(self, input_signal, sample_rate, desired_signal):
super(StepSignal, self).__init__()
self.initial_signal = input_signal
self.signal = self.initial_signal.copy()
self.sample_rate = sample_rate
self.desired_signal = desired_signal
self.distance_threshold = 10e-1
max_offset = abs(max( max(self.desired_signal) , max(self.signal))
- min( min(self.desired_signal) , min(self.signal)) )
self.action_space = spaces.Box(low=np.array([10e-4,-max_offset]),\
high=np.array([self.sample_rate/2-0.1,max_offset]), dtype=np.float16)
obs = self._get_obs()
self.observation_space = spaces.Dict(dict(
desired_goal=spaces.Box(-np.inf, np.inf, shape=obs['achieved_goal'].shape, dtype='float32'),
achieved_goal=spaces.Box(-np.inf, np.inf, shape=obs['achieved_goal'].shape, dtype='float32'),
observation=spaces.Box(-np.inf, np.inf, shape=obs['observation'].shape, dtype='float32'),
))
def step(self, action):
range = self.action_space.high - self.action_space.low
action = range / 2 * (action + 1)
self._set_action(action)
obs = self._get_obs()
done = False
info = {
'is_success': self._is_success(obs['achieved_goal'], self.desired_signal),
}
reward = -self.compute_reward(obs['achieved_goal'],self.desired_signal)
return obs, reward, done, info
def reset(self):
self.signal = self.initial_signal.copy()
return self._get_obs()
def _set_action(self, actions):
actions = np.clip(actions,a_max=self.action_space.high,a_min=self.action_space.low)
cutoff = actions[0]
offset = actions[1]
print(cutoff, offset)
self.signal = butter_lowpass_filter(self.signal, cutoff, self.sample_rate/2) - offset
def _get_obs(self):
obs = self.signal
achieved_goal = self.signal
return {
'observation': obs.copy(),
'achieved_goal': achieved_goal.copy(),
'desired_goal': self.desired_signal.copy(),
}
def compute_reward(self, goal_achieved, goal_desired):
d = np.linalg.norm(goal_desired-goal_achieved)
return d
def _is_success(self, achieved_goal, desired_goal):
d = self.compute_reward(achieved_goal, desired_goal)
return (d < self.distance_threshold).astype(np.float32)
然后可以将环境实例化为变量,并按照https://openai.com/blog/ingredients-for-robotics-research/(页面结尾)的建议通过FlattenDictWrapper
进行展平。
length = 20
sample_rate = 30 # 30 Hz
in_signal_length = 20*sample_rate # 20sec signal
x = np.linspace(0, length, in_signal_length)
# Desired output
y = 3*np.ones(in_signal_length)
# Step signal
in_signal = 0.5*(np.sign(x-5)+9)
env = gym.make('stepsignal-v0', input_signal=in_signal, sample_rate=sample_rate, desired_signal=y)
env = gym.wrappers.FlattenDictWrapper(env, dict_keys=['observation','desired_goal'])
env.reset()
该代理是来自keras-rl
的DDPG代理,因为操作可以采用环境中描述的连续action_space中的任何值。
我想知道为什么演员和评论家网需要在 input_shape=(1,) + env.observation_space.shape
nb_actions = env.action_space.shape[0]
# Building Actor agent (Policy-net)
actor = Sequential()
actor.add(Flatten(input_shape=(1,) + env.observation_space.shape, name='flatten'))
actor.add(Dense(128))
actor.add(Activation('relu'))
actor.add(Dense(64))
actor.add(Activation('relu'))
actor.add(Dense(nb_actions))
actor.add(Activation('linear'))
actor.summary()
# Building Critic net (Q-net)
action_input = Input(shape=(nb_actions,), name='action_input')
observation_input = Input(shape=(1,) + env.observation_space.shape, name='observation_input')
flattened_observation = Flatten()(observation_input)
x = Concatenate()([action_input, flattened_observation])
x = Dense(128)(x)
x = Activation('relu')(x)
x = Dense(64)(x)
x = Activation('relu')(x)
x = Dense(1)(x)
x = Activation('linear')(x)
critic = Model(inputs=[action_input, observation_input], outputs=x)
critic.summary()
# Building Keras agent
memory = SequentialMemory(limit=2000, window_length=1)
policy = BoltzmannQPolicy()
random_process = OrnsteinUhlenbeckProcess(size=nb_actions, theta=0.6, mu=0, sigma=0.3)
agent = DDPGAgent(nb_actions=nb_actions, actor=actor, critic=critic, critic_action_input=action_input,
memory=memory, nb_steps_warmup_critic=2000, nb_steps_warmup_actor=10000,
random_process=random_process, gamma=.99, target_model_update=1e-3)
agent.compile(Adam(lr=1e-3, clipnorm=1.), metrics=['mae'])
最后,对代理进行了训练:
filename = 'mem20k_heaviside_flattening'
hist = agent.fit(env, nb_steps=10, visualize=False, verbose=2, nb_max_episode_steps=5)
with open('./history_dqn_test_'+ filename + '.pickle', 'wb') as handle:
pickle.dump(hist.history, handle, protocol=pickle.HIGHEST_PROTOCOL)
agent.save_weights('h5f_files/dqn_{}_weights.h5f'.format(filename), overwrite=True)
现在是要抓住的地方:对于我的env的同一实例,座席似乎始终在所有情节中都停留在相同的输出值附近:
累计奖励为负,因为我只允许代理获得负奖励。我使用了https://github.com/openai/gym/blob/master/gym/envs/robotics/fetch_env.py中的示例,它是OpenAI代码的一部分。 在一个情节中,我应该获得各种动作集合,收敛到一个(cutoff_final,offset_final),这将使我的输入阶跃信号接近于我的输出平坦信号,显然不是这种情况。另外,我认为对于连续的情节,我应该采取不同的动作。
答案 0 :(得分:0)
我想知道演员和批评家网为什么需要输入一个具有附加维度的输入,输入形式为input_shape =(1,)+ env.observation_space.shape
我认为GoalEnv
的设计考虑了 HER (Hindsight体验重播),因为它将使用observation_space
内的“子空间”来学习稀疏的奖励信号(OpenAI网站上有一篇论文解释了 HER 的工作原理)。没看过实现,但是我猜是因为 HER 还处理“目标”参数,所以还需要额外的输入。
由于似乎您没有使用HER(适用于任何非策略算法,包括DQN,DDPG等),因此您应该手工制作信息丰富的奖励功能(奖励不是二进制的,例如,如果达到目标则为1,否则为0 )并使用Env
基类。奖励应该在step
方法内计算,因为MDP中的奖励是 r(s,a,s`)之类的函数,您可能会拥有所需的所有信息。希望对您有所帮助。