因此,当我使用影片数据集检查tensorflow中的跳过语法模型的实现时。我遇到了这个功能:
def generate_batch_data(sentences, batch_size, window_size, method='skip_gram'):
# Fill up data batch
batch_data = []
label_data = []
while len(batch_data) < batch_size:
# select random sentence to start
rand_sentence = np.random.choice(sentences)
# Generate consecutive windows to look at
window_sequences = [rand_sentence[max((ix-window_size),0):(ix+window_size+1)] for ix, x in enumerate(rand_sentence)]
# Denote which element of each window is the center word of interest
label_indices = [ix if ix<window_size else window_size for ix,x in enumerate(window_sequences)]
# Pull out center word of interest for each window and create a tuple for each window
if method=='skip_gram':
batch_and_labels = [(x[y], x[:y] + x[(y+1):]) for x,y in zip(window_sequences, label_indices)]
# Make it in to a big list of tuples (target word, surrounding word)
tuple_data = [(x, y_) for x,y in batch_and_labels for y_ in y]
elif method=='cbow':
batch_and_labels = [(x[:y] + x[(y+1):], x[y]) for x,y in zip(window_sequences, label_indices)]
# Make it in to a big list of tuples (target word, surrounding word)
tuple_data = [(x_, y) for x,y in batch_and_labels for x_ in x]
else:
raise ValueError('Method {} not implemented yet.'.format(method))
# extract batch and labels
batch, labels = [list(x) for x in zip(*tuple_data)]
batch_data.extend(batch[:batch_size])
label_data.extend(labels[:batch_size])
# Trim batch and label at the end
batch_data = batch_data[:batch_size]
label_data = label_data[:batch_size]
# Convert to numpy array
batch_data = np.array(batch_data)
label_data = np.transpose(np.array([label_data]))
return(batch_data, label_data)
但是我已经尝试了好几天了,但是还没有弄清楚。如果您想拥有更广阔的视野,则整个代码为here。
因此,在代码中,我们有一个最频繁的10000个单词的数字。我们将数字形式的句子传递给上面的函数。由于这是一个跳过语法模型,因此我们必须查看相邻的单词。但是,该算法如何完成呢? window_sequences = [rand_sentence[max((ix-window_size),0):(ix+window_size+1)] for ix, x in enumerate(rand_sentence)]
会不会创建一个在频率上相邻但在句子用法上相邻的单词窗口?
我希望在这里澄清一下。
非常感谢!
答案 0 :(得分:1)
将以下句子视为标记:
sentence = ["the","book","is","on","the","table"]
并考虑{3}的window_size
。
构建window_sequences
的代码可以这样重新编写:
for ix in range(len(sentence)):
x = sentence[ix] #so this is the ix-th word of the sentence
from_index = max((ix-window_size) # this is the initial index of the window
to_index = (ix+window_size+1) # this is the final index of the windows (excluding itself)
window = sentence[from_index, to_index] # we pick the words of the sentence
现在让我们为某些ix
运行此代码:
ix=0, x="the", from_index=0, to_index=4, window = ["the", "book", "is", "on"]
ix=3, x="on", from_index=0, to_index=7, window = ["the", "book", "is", "on", "the", "table"]
如您所见,它正在构造单词的窗口,它们恰好是原始句子的一部分。
您可能会在分析此代码时遇到一个问题,就是用数字ID替换句子中的单词,使得单词越频繁,其ID越低。
所以前面的句子看起来像:
sentence = [2,45,7,13,2,67]
它们不是按频率顺序排序的,但它们恰好保持句子中的顺序。仅将其表面形式从string
更改为int
,但您可以轻松理解字符串版本上的代码。