加入给定列表中的单词组

时间:2017-02-10 17:32:04

标签: python

我的问题看起来很简单,但我无法找到一个干净(高效)的解决方案。

我有一个与常用词组对应的元组列表:

ngrams = [("data", "scientist"),
          ("machine", "learning"),
          ("c", "+"),
          ("+", "+"),
          ("c", "+", "+"),
          ("research", "and", "development"),
          ("research", "and")]

一句话:

"i am a data scientist . i do machine learning and c + + but no deep learning . i like research and development"

我想将单个标记中的常用单词组合并为:

"i am a data_scientist . i do machine_learning and c_+_+ but no deep_learning . i like research_and_development"

我确信有一种优雅的方式可以这样做,但我找不到任何......

如果只有2个元组,那么迭代zip(sentence, sentence[:1]会这样做,但我在ngrams中最多有8个元组,这个解决方案不易处理!

3 个答案:

答案 0 :(得分:1)

您可以在ngrams中的单词中构建替换字符串列表:

replace = [" ".join(x) for x in ngrams]

然后,对于该列表中的每个元素,使用str.replace

for r in replace:
    sentence = sentence.replace(r, r.replace(" ", "_"))

可能有更多的单行方式,但这似乎相对简洁,易于理解。

答案 1 :(得分:1)

虽然Haldean Brown的答案更简单,但我认为这是一种更有条理的方法:

ngrams = [("data", "scientist"),
          ("machine", "learning"),
          ("c", "+"),
          ("+", "+"),
          ("c", "+", "+"),
          ("research", "and", "development"),
          ("research", "and")]
sent = """
    i am a data scientist . i do machine learning and c + + but no deep
    learning . i like research and development
"""

ngrams.sort(key=lambda x: -len(x))
tokens = sent.split()

out_ngrams = []
i_token = 0
while i_token < len(tokens):
    for ngram in ngrams:
        if ngram == tuple(tokens[i_token : i_token + len(ngram)]):
            i_token += len(ngram)
            out_ngrams.append(ngram)
            break
    else:
        out_ngrams.append((tokens[i_token],))
        i_token += 1

print(' '.join('_'.join(ngram) for ngram in out_ngrams))

输出:

i am a data_scientist . i do machine_learning and c_+_+ but no deep learning . i like research_and_development
排序后

ngrams

[('c', '+', '+'),
 ('research', 'and', 'development'),
 ('data', 'scientist'),
 ('machine', 'learning'),
 ('c', '+'),
 ('+', '+'),
 ('research', 'and')]

需要尝试在("c", "+", "+")之前应用("c", "+")(或者,通常,尝试应用早于其前缀的序列)。实际上,像[('c', '+'), ('+', 'a')]这样的非贪婪的东西比[('c', '+', '+'), ('a',)]更可取,但这是另一个故事。

答案 2 :(得分:0)

s = ''
seq = ("c", "+", "+")
print(s.join(seq))

关于连接方法的更多信息: Python文档

TTPS://docs.python.org/3/library/stdtypes.html突出=加入#str.join