我的问题看起来很简单,但我无法找到一个干净(高效)的解决方案。
我有一个与常用词组对应的元组列表:
ngrams = [("data", "scientist"),
("machine", "learning"),
("c", "+"),
("+", "+"),
("c", "+", "+"),
("research", "and", "development"),
("research", "and")]
一句话:
"i am a data scientist . i do machine learning and c + + but no deep learning . i like research and development"
我想将单个标记中的常用单词组合并为:
"i am a data_scientist . i do machine_learning and c_+_+ but no deep_learning . i like research_and_development"
我确信有一种优雅的方式可以这样做,但我找不到任何......
如果只有2个元组,那么迭代zip(sentence, sentence[:1]
会这样做,但我在ngrams
中最多有8个元组,这个解决方案不易处理!
答案 0 :(得分:1)
您可以在ngrams
中的单词中构建替换字符串列表:
replace = [" ".join(x) for x in ngrams]
然后,对于该列表中的每个元素,使用str.replace
:
for r in replace:
sentence = sentence.replace(r, r.replace(" ", "_"))
可能有更多的单行方式,但这似乎相对简洁,易于理解。
答案 1 :(得分:1)
虽然Haldean Brown的答案更简单,但我认为这是一种更有条理的方法:
ngrams = [("data", "scientist"),
("machine", "learning"),
("c", "+"),
("+", "+"),
("c", "+", "+"),
("research", "and", "development"),
("research", "and")]
sent = """
i am a data scientist . i do machine learning and c + + but no deep
learning . i like research and development
"""
ngrams.sort(key=lambda x: -len(x))
tokens = sent.split()
out_ngrams = []
i_token = 0
while i_token < len(tokens):
for ngram in ngrams:
if ngram == tuple(tokens[i_token : i_token + len(ngram)]):
i_token += len(ngram)
out_ngrams.append(ngram)
break
else:
out_ngrams.append((tokens[i_token],))
i_token += 1
print(' '.join('_'.join(ngram) for ngram in out_ngrams))
输出:
i am a data_scientist . i do machine_learning and c_+_+ but no deep learning . i like research_and_development
排序后 ngrams
:
[('c', '+', '+'),
('research', 'and', 'development'),
('data', 'scientist'),
('machine', 'learning'),
('c', '+'),
('+', '+'),
('research', 'and')]
需要尝试在("c", "+", "+")
之前应用("c", "+")
(或者,通常,尝试应用早于其前缀的序列)。实际上,像[('c', '+'), ('+', 'a')]
这样的非贪婪的东西比[('c', '+', '+'), ('a',)]
更可取,但这是另一个故事。
答案 2 :(得分:0)
s = ''
seq = ("c", "+", "+")
print(s.join(seq))
关于连接方法的更多信息: Python文档
TTPS://docs.python.org/3/library/stdtypes.html突出=加入#str.join