在python中获取连续的子串

时间:2017-12-08 04:05:59

标签: python

给定一个n-gram词,我想得到从“开始到结束”的连续子串模式。并且'结束开始'。

例如,对于4-gram computer supported machine translation,我应该得到以下子串。

  • 从头到尾:computer supportedcomputer supported machine
  • 从头到尾:machine translationsupported machine translation

对于3-gram natural language processing,我应该得到natural languagelanguage processing

我有非常大的n-gram,所以我有兴趣了解最快的方法!

2 个答案:

答案 0 :(得分:0)

您可以将{-1}} n-gram写入克列表,然后split个切片(请参阅Understanding Python's slice notation):

join

答案 1 :(得分:0)

您应该使用一个函数,然后将ngram作为参数传递:

从@Galen借来的一些代码:

def count_grams(gram,sentence):
    grams = sentence.split(" ")

    words=[]
    for i in range(gram,len(grams)):
        start=[]
        start.append(" ".join(grams[:i]))
        words.append(start)
    for j in range(gram,len(grams)):
        end=[]
        end.append(" ".join(grams[-j:]))
        words.append(end)

    return words



print(count_grams(2,'computer supported machine translation'))
print(count_grams(2,'natural language processing'))

输出:

[['computer supported'], ['computer supported machine'], ['machine translation'], ['supported machine translation']]
[['natural language'], ['language processing']]

如果您不想在列表中显示结果,可以使用" ".join()