我正在使用Keras在python中使用图像字幕系统,当使用argmax搜索时,我得到了合理的结果(~0.58 Bleu_1得分和句子非常多样化)。
然而,当我尝试光束搜索时,我会为每个图像得到几乎相同的句子。
我有以下代码用于生成字幕:
# create an array of captions for a chunk of images; first token
# of each caption is the start token
test_x = np.zeros((chunk_size, self.max_len - 1), dtype=np.int)
test_x[:, 0] = self.start_idx + 1
# probability of each caption is 1
captions_probs = np.ones(chunk_size)
# for every image, maintain a heap with the best captions
self.best_captions = [FixedCapacityMaxHeap(20) for i in range(chunk_size)]
# call beam search using the current cnn features
self.beam_search(cnn_feats, test_x, captions_probs, 0, beam_size)
光束搜索方法如下:
def beam_search(self, cnn_feats, generated_captions, captions_probs, t, beam_size):
# base case: the generated captions have max_len length, so
# we can remove the (zero) pad at the end and for each image
# we can insert the generated caption and its probablity into
# the heap with the best captions
if t == self.max_len - 1:
for i in range(len(generated_captions)):
caption = self.remove_zero_pad(list(generated_captions[i]))
self.best_captions[i].push(list(caption), captions_probs[i])
else:
# otherwise, make a prediction (we only keep the element at time
# step t + 1, as the LSTM has a many-to-many architecture, but we
# are only interested in the next token (for each image).
pred = self.model.predict(x=[cnn_feats, generated_captions],
batch_size=128,
verbose=1)[:, t + 1, :]
# efficiently get the indices of the tokens with the greatest probability
# for each image (they are not necessarily sorted)
top_idx = np.argpartition(-pred, range(beam_size), axis=1)[:, :beam_size]
# store the probability of those tokens
top_probs = pred[np.arange(top_idx.shape[0])[:, None], top_idx]
# for every 'neighbour' (set of newly generated tokens for every image)
# get the indices of these tokens, add them to the current captions and
# update the captions probabilities by multiplying them with the probabilities
# of the current tokens, then recursively call beam_search
for i in range(beam_size):
curr_idx = top_idx[:, i]
generated_captions[:, t + 1] = curr_idx
curr_captions_probs = top_probs[:, i] * captions_probs
self.beam_search(cnn_feats, generated_captions, curr_captions_probs, t+1, beam_size)
我使用的FixedCapacityHeap是:
class FixedCapacityMaxHeap(object):
def __init__(self, capacity):
self.capacity = capacity
self.h = []
def push(self, value, priority):
if len(self.h) < self.capacity:
heapq.heappush(self.h, (priority, value))
else:
heapq.heappushpop(self.h, (priority, value))
def pop(self):
if len(self.h) >= 0:
return heapq.nlargest(1, self.h)[0]
else:
return None
问题在于,使用光束搜索生成的字幕对于每个图像几乎都是相同的(例如:'缩放a','缩放a'是'','缩放a'是'),而argmax版本(只是在每个时间步骤中获取具有最高概率的令牌)能够实际产生好的字幕。我已经坚持了很长时间了。我尝试了不同的实现(使用beam_seach调用计算每个图像的标题,而不是一次计算所有图像)我还尝试了softmax温度参数(这是LSTM在预测中有多自信) ,但这些似乎都没有解决问题,所以任何想法都值得赞赏。
答案 0 :(得分:2)
我很久以前就做过这个实现,但我希望它有所帮助。它不是递归的:
https://github.com/mmehdig/lm_beam_search/blob/master/beam_search.py
def search(model, src_input, k=1, sequence_max_len=25):
# (log(1), initialize_of_zeros)
k_beam = [(0, [0]*(sequence_max_len+1))]
# l : point on target sentence to predict
for l in range(sequence_max_len):
all_k_beams = []
for prob, sent_predict in k_beam:
predicted = model.predict([np.array([src_input]), np.array([sent_predict])])[0]
# top k!
possible_k = predicted[l].argsort()[-k:][::-1]
# add to all possible candidates for k-beams
all_k_beams += [
(
sum(np.log(predicted[i][sent_predict[i+1]]) for i in range(l)) + np.log(predicted[l][next_wid]),
list(sent_predict[:l+1])+[next_wid]+[0]*(sequence_max_len-l-1)
)
for next_wid in possible_k
]
# top k
k_beam = sorted(all_k_beams)[-k:]
return k_beam