我有一个原始的字符串列表。我拼凑了此列表,以对原始数据中的每个项目进行标签编码。标签编码后,我将标签压缩回单词,作为一个简单的元组列表。现在,我想将此元组列表转换回字符串列表结构的原始列表。下面的示例:
original_data = [[['hey how are you?'], ['I am fine, and you?'], ['I am fine, too.']], [["My name is Jason, what's your name?"], ['My name is Tina.'], ['Nice to meet you.'], ['Nice to meet you, too,']]]
flat_words = ['hey', 'how', 'are', 'you?', 'I', 'am', 'fine,', 'and', 'you?', 'I', 'am', 'fine,', 'too.', 'My', 'name', 'is', 'Jason,', "what's", 'your', 'name?', 'My', 'name', 'is', 'Tina.', 'Nice', 'to', 'meet', 'you.', 'Nice', 'to', 'meet', 'you,', 'too,']
labels = [9, 10, 7, 21, 0, 5, 8, 6, 21, 0, 5, 8, 17, 2, 13, 11, 1, 18, 22, 14, 2, 13, 11, 4, 3, 15, 12, 20, 3, 15, 12, 19, 16]
flat_words_with_labels = [('hey', 9), ('how', 10), ('are', 7), ('you?', 21), ('I', 0), ('am', 5), ('fine,', 8), ('and', 6), ('you?', 21), ('I', 0), ('am', 5), ('fine,', 8), ('too.', 17), ('My', 2), ('name', 13), ('is', 11), ('Jason,', 1), ("what's", 18), ('your', 22), ('name?', 14), ('My', 2), ('name', 13), ('is', 11), ('Tina.', 4), ('Nice', 3), ('to', 15), ('meet', 12), ('you.', 20), ('Nice', 3), ('to', 15), ('meet', 12), ('you,', 19), ('too,', 16)]
我想要的是:
final = [[[('hey', 9), ('how', 10), ('are', 7), ('you?', 21)], [('I', 0), ('am', 5), ('fine,', 8), ('and', 6), ('you?', 21)], [('I', 0), ('am', 5), ('fine,', 8), ('too.', 17)]], [[('My', 2), ('name', 13), ('is', 11), ('Jason,', 1), ("what's", 18), ('your', 22), ('name?', 14)], [('My', 2), ('name', 13), ('is', 11), ('Tina.', 4)], [('Nice', 3), ('to', 15), ('meet', 12), ('you.', 20)], [('Nice', 3), ('to', 15), ('meet', 12), ('you,', 19), ('too,', 16)]]]
答案 0 :(得分:1)
一站式就可以了:
d = dict(flat_words_with_labels)
final = [[[(word, d[word]) for word in sentence[0].split()] for sentence in paragraph] for paragraph in original_data]
答案 1 :(得分:1)
这是一种看起来比较干净并且可以处理任何级别嵌套的方法。
original_data = [[['hey how are you?'], ['I am fine, and you?'], ['I am fine, too.']], [["My name is Jason, what's your name?"], ['My name is Tina.'], ['Nice to meet you.'], ['Nice to meet you, too,']]]
flat_words = ['hey', 'how', 'are', 'you?', 'I', 'am', 'fine,', 'and', 'you?', 'I', 'am', 'fine,', 'too.', 'My', 'name', 'is', 'Jason,', "what's", 'your', 'name?', 'My', 'name', 'is', 'Tina.', 'Nice', 'to', 'meet', 'you.', 'Nice', 'to', 'meet', 'you,', 'too,']
labels = [9, 10, 7, 21, 0, 5, 8, 6, 21, 0, 5, 8, 17, 2, 13, 11, 1, 18, 22, 14, 2, 13, 11, 4, 3, 15, 12, 20, 3, 15, 12, 19, 16]
mapping = {word: label for word, label in zip(flat_words, labels)}
def replace(lst, mapping):
"""
Recursively go through lst and replace every `word`
with the word and its mapping: (`word`: mapping[`word`])
"""
for index, ele in enumerate(lst):
if isinstance(ele, str):
result = [(word, mapping[word]) for word in ele.split()]
lst[:] = result
break
else:
lst[index] = replace(ele, mapping)
return lst
r = replace(original_data, mapping)
print(r)
结果:
[[[('hey', 9), ('how', 10), ('are', 7), ('you?', 21)], [('I', 0), ('am', 5), ('fine,', 8), ('and', 6), ('you?', 21)], [('I', 0), ('am', 5), ('fine,', 8), ('too.', 17)]], [[('My', 2), ('name', 13), ('is', 11), ('Jason,', 1), ("what's", 18), ('your', 22), ('name?', 14)], [('My', 2), ('name', 13), ('is', 11), ('Tina.', 4)], [('Nice', 3), ('to', 15), ('meet', 12), ('you.', 20)], [('Nice', 3), ('to', 15), ('meet', 12), ('you,', 19), ('too,', 16)]]]
答案 2 :(得分:0)
您可以重复使用original_data
的结构,并将labels
变成迭代器以构造final
。我确定那里有一个更优雅的解决方案,但类似的方法可能有用:
labels_iter = iter(labels)
final = []
for convo in original_data:
final.append([])
for sent in convo:
final[-1].append([])
for word in sent[0].split(' '):
final[-1][-1].append((word, next(labels_iter)))
final
出局:
[[[('hey', 9), ('how', 10), ('are', 7), ('you?', 21)],
[('I', 0), ('am', 5), ('fine,', 8), ('and', 6), ('you?', 21)],
[('I', 0), ('am', 5), ('fine,', 8), ('too.', 17)]],
[[('My', 2),
('name', 13),
('is', 11),
('Jason,', 1),
("what's", 18),
('your', 22),
('name?', 14)],
[('My', 2), ('name', 13), ('is', 11), ('Tina.', 4)],
[('Nice', 3), ('to', 15), ('meet', 12), ('you.', 20)],
[('Nice', 3), ('to', 15), ('meet', 12), ('you,', 19), ('too,', 16)]]]