给定一个n-gram词,我想得到从“开始到结束”的连续子串模式。并且'结束开始'。
例如,对于4-gram computer supported machine translation
,我应该得到以下子串。
computer supported
,computer supported machine
machine translation
,supported machine translation
对于3-gram natural language processing
,我应该得到natural language
和language processing
。
我有非常大的n-gram,所以我有兴趣了解最快的方法!
答案 0 :(得分:0)
您可以将{-1}} n-gram写入克列表,然后split
个切片(请参阅Understanding Python's slice notation):
join
答案 1 :(得分:0)
您应该使用一个函数,然后将ngram作为参数传递:
从@Galen借来的一些代码:
def count_grams(gram,sentence):
grams = sentence.split(" ")
words=[]
for i in range(gram,len(grams)):
start=[]
start.append(" ".join(grams[:i]))
words.append(start)
for j in range(gram,len(grams)):
end=[]
end.append(" ".join(grams[-j:]))
words.append(end)
return words
print(count_grams(2,'computer supported machine translation'))
print(count_grams(2,'natural language processing'))
输出:
[['computer supported'], ['computer supported machine'], ['machine translation'], ['supported machine translation']]
[['natural language'], ['language processing']]
如果您不想在列表中显示结果,可以使用" ".join()