我有一本这样的字典:
dic_parsed_sentences = {'talk.religion': {'david': 1, 'joslin': 1, 'apolog': 5, 'jim': 1, 'meritt': 2},
'talk.sport': {'sari': 1, 'basebal': 1, 'kolang': 5, 'footbal': 1, 'baba': 2},
'talk.education': {'madrese': 1, 'kelas': 1, 'yahyah': 5, 'dars': 1},
'talk.computer': {'net': 1, 'internet': 1},
'talk.windows': {'copy': 1, 'right': 1}}
我希望输出如下:
sent_wids = [[1, 1, 5, 1, 2],
[1, 1, 5, 1, 2]]
下一次迭代:
sent_wids = [[1, 1, 5, 1]]
下一次迭代:
sent_wids = [[1, 1],
[1,1]]
和train_labels列表:
train_label = ['talk.religion', 'talk.sport','talk.education','talk.computer', 'talk.windows']
为什么每次迭代显示所需的输出?因为我想一起考虑相同长度的项目,以便对它们执行某些操作(实际上是准备将数据馈送到LSTM,所以我希望将相同长度的句子分批分组并使用fit_generator
)。>
我已经做到了,但是效率很低:
这是我的解决方案:
在count_list
中,我计算了具有相同长度的项数,以便以后可以创建相应的numpy数组形状。
flattened_val = [(k,) + tuple(dic.items()) for k, dic in dic_parsed_sentences.items()]
sorted_d_val = sorted(flattened_val, key=len, reverse=True)
count_list = []
count = 1
for i in range(len(sorted_d_val)):
if i != len(sorted_d_val)-1:
if len(sorted_d_val[i]) == len(sorted_d_val[i+1]):
count = count + 1
else:
count_list.append(count)
count = 1
else:
count_list.append(count)
train_labels = []
i_for_arr = 0
for length, dics in itertools.groupby(sorted_d_val, len):
sent_wids = np.zeros([count_list[i_for_arr],length-1])
for index_sentence,sentence in enumerate(dics):
i_help = 0
index_word = 0
for j in sentence:
if i_help == 0:
train_labels.append(j)
else:
sent_wids[index_sentence, index_word] = lookup_word2id(j[0]) # as you do not have lookup_word2id you can change it to a constant number like 1 as this part is ok.
index_word = index_word + 1
i_help = i_help + 1
i_for_arr = i_for_arr + 1
有没有什么方法可以做到这点呢?