与为跳过语法模型生成批处理数据混淆

时间:2018-06-24 07:56:51

标签: python tensorflow machine-learning nlp

因此,当我使用影片数据集检查tensorflow中的跳过语法模型的实现时。我遇到了这个功能:

def generate_batch_data(sentences, batch_size, window_size, method='skip_gram'):
    # Fill up data batch
    batch_data = []
    label_data = []
    while len(batch_data) < batch_size:
        # select random sentence to start
        rand_sentence = np.random.choice(sentences)
        # Generate consecutive windows to look at
        window_sequences = [rand_sentence[max((ix-window_size),0):(ix+window_size+1)] for ix, x in enumerate(rand_sentence)]
        # Denote which element of each window is the center word of interest
        label_indices = [ix if ix<window_size else window_size for ix,x in enumerate(window_sequences)]

        # Pull out center word of interest for each window and create a tuple for each window
        if method=='skip_gram':
            batch_and_labels = [(x[y], x[:y] + x[(y+1):]) for x,y in zip(window_sequences, label_indices)]
            # Make it in to a big list of tuples (target word, surrounding word)
            tuple_data = [(x, y_) for x,y in batch_and_labels for y_ in y]
        elif method=='cbow':
            batch_and_labels = [(x[:y] + x[(y+1):], x[y]) for x,y in zip(window_sequences, label_indices)]
            # Make it in to a big list of tuples (target word, surrounding word)
            tuple_data = [(x_, y) for x,y in batch_and_labels for x_ in x]
        else:
            raise ValueError('Method {} not implemented yet.'.format(method))

        # extract batch and labels
        batch, labels = [list(x) for x in zip(*tuple_data)]
        batch_data.extend(batch[:batch_size])
        label_data.extend(labels[:batch_size])
    # Trim batch and label at the end
    batch_data = batch_data[:batch_size]
    label_data = label_data[:batch_size]

    # Convert to numpy array
    batch_data = np.array(batch_data)
    label_data = np.transpose(np.array([label_data]))

    return(batch_data, label_data)

但是我已经尝试了好几天了,但是还没有弄清楚。如果您想拥有更广阔的视野,则整个代码为here

因此,在代码中,我们有一个最频繁的10000个单词的数字。我们将数字形式的句子传递给上面的函数。由于这是一个跳过语法模型,因此我们必须查看相邻的单词。但是,该算法如何完成呢? window_sequences = [rand_sentence[max((ix-window_size),0):(ix+window_size+1)] for ix, x in enumerate(rand_sentence)]会不会创建一个在频率上相邻但在句子用法上相邻的单词窗口?
我希望在这里澄清一下。

非常感谢!

1 个答案:

答案 0 :(得分:1)

将以下句子视为标记:

sentence = ["the","book","is","on","the","table"]

并考虑{3}的window_size。 构建window_sequences的代码可以这样重新编写:

for ix in range(len(sentence)):
    x = sentence[ix] #so this is the ix-th word of the sentence
    from_index = max((ix-window_size) # this is the initial index of the window
    to_index = (ix+window_size+1) # this is the final index of the windows (excluding itself)
    window = sentence[from_index, to_index] # we pick the words of the sentence

现在让我们为某些ix运行此代码:

ix=0, x="the", from_index=0, to_index=4, window = ["the", "book", "is", "on"]
ix=3, x="on", from_index=0, to_index=7, window = ["the", "book", "is", "on", "the", "table"]

如您所见,它正在构造单词的窗口,它们恰好是原始句子的一部分。

您可能会在分析此代码时遇到一个问题,就是用数字ID替换句子中的单词,使得单词越频繁,其ID越低。

所以前面的句子看起来像:

sentence = [2,45,7,13,2,67]

它们不是按频率顺序排序的,但它们恰好保持句子中的顺序。仅将其表面形式从string更改为int,但您可以轻松理解字符串版本上的代码。