如何在keras中的LSTM中从可变长度字典中有效地提取数据

时间:2019-08-11 18:23:31

标签: python-3.x dictionary keras lstm seq2seq

我有一本这样的字典:

dic_parsed_sentences = {'talk.religion': {'david': 1, 'joslin': 1, 'apolog': 5, 'jim': 1, 'meritt': 2},
 'talk.sport': {'sari': 1, 'basebal': 1, 'kolang': 5, 'footbal': 1, 'baba': 2},
 'talk.education': {'madrese': 1, 'kelas': 1, 'yahyah': 5, 'dars': 1},
 'talk.computer': {'net': 1, 'internet': 1},
 'talk.windows': {'copy': 1, 'right': 1}}

我希望输出如下:

sent_wids = [[1, 1, 5, 1, 2],
             [1, 1, 5, 1, 2]]

下一次迭代:

sent_wids = [[1, 1, 5, 1]]

下一次迭代:

sent_wids = [[1, 1],
             [1,1]]

和train_labels列表:

train_label = ['talk.religion', 'talk.sport','talk.education','talk.computer', 'talk.windows']

为什么每次迭代显示所需的输出?因为我想一起考虑相同长度的项目,以便对它们执行某些操作(实际上是准备将数据馈送到LSTM,所以我希望将相同长度的句子分批分组并使用fit_generator)。

我已经做到了,但是效率很低: 这是我的解决方案: 在count_list中,我计算了具有相同长度的项数,以便以后可以创建相应的numpy数组形状。

flattened_val = [(k,) + tuple(dic.items()) for k, dic in dic_parsed_sentences.items()]
sorted_d_val = sorted(flattened_val, key=len, reverse=True)
count_list = []
count = 1
for i in range(len(sorted_d_val)):
    if i != len(sorted_d_val)-1:
        if len(sorted_d_val[i]) == len(sorted_d_val[i+1]):
            count = count + 1
        else:
            count_list.append(count)
            count = 1
    else:
        count_list.append(count)

train_labels = []
i_for_arr = 0
for length, dics in itertools.groupby(sorted_d_val, len):
    sent_wids = np.zeros([count_list[i_for_arr],length-1])
    for index_sentence,sentence in enumerate(dics):
        i_help = 0
        index_word = 0
        for j in sentence:
            if i_help == 0:
                train_labels.append(j)
            else:
                sent_wids[index_sentence, index_word] = lookup_word2id(j[0]) # as you do not have lookup_word2id you can change it to a constant number like 1 as this part is ok.
                index_word = index_word + 1
            i_help = i_help + 1
    i_for_arr = i_for_arr + 1

有没有什么方法可以做到这点呢?

0 个答案:

没有答案