Question

说我有一串单词：'a b c d e f'。我想从这个字符串中生成一个多字词的列表。

字顺序很重要。不应从上面的示例生成术语'f e d'。

修改：此外，不应跳过字词。 <{1}}或'a c'不应该生成。

我现在拥有的东西：

'b d f'

打印

doc = 'a b c d e f'
terms= []
one_before = None
two_before = None
for word in doc.split(None):
    terms.append(word)
    if one_before:
        terms.append(' '.join([one_before, word]))
    if two_before:
        terms.append(' '.join([two_before, one_before, word]))
    two_before = one_before
    one_before = word

for term in terms:
    print term

我如何将它作为一个递归函数，以便每次传递一个可变的最大单词数？

应用

我将使用它来从HTML文档中的可读文本生成多字词。总体目标是对大型语料库（大约200万个文档）进行潜在的语义分析。这就是为什么保持单词顺序很重要（自然语言处理等等）。

Answer 1

这不是递归的，但我认为它可以做你想要的。

doc = 'a b c d e f'
words = doc.split(None)
max = 3          


for index in xrange(len(words)):    
    for n in xrange(max):
        if index + n < len(words):           
            print ' '.join(words[index:index+n+1])

这是一个递归解决方案：

def find_terms(words, max_words_per_term):       
    if len(words) == 0: return []
    return [" ".join(words[:i+1]) for i in xrange(min(len(words), max_words_per_term))] + find_terms(words[1:], max_words_per_term)


doc = 'a b c d e f'
words = doc.split(None) 
for term in find_terms(words, 3):
    print term

这里再次使用递归函数，其中一些解释变量和注释。

def find_terms(words, max_words_per_term):   

    # If there are no words, you've reached the end. Stop.    
    if len(words) == 0:
        return []      

    # What's the max term length you could generate from the remaining 
    # words? It's the lesser of max_words_per_term and how many words 
    # you have left.                                                         
    max_term_len = min(len(words), max_words_per_term)       

    # Find all the terms that start with the first word.
    initial_terms = [" ".join(words[:i+1]) for i in xrange(max_term_len)]

    # Here's the recursion. Find all of the terms in the list 
    # of all but the first word.
    other_terms = find_terms(words[1:], max_words_per_term)

    # Now put the two lists of terms together to get the answer.
    return initial_terms + other_terms

Answer 2

我建议您将函数设置为生成器，然后生成所需数量的术语。您需要将print更改为yield（并显然可以创建整个块功能）。

您也可以查看itertools模块，它对您的工作非常有用。

Answer 3

你为什么要这样做？您只需使用for循环和itertools.combinations()。

Answer 4

您正在寻找的是N-gram算法。那会给你[a，ab，b，bc，c，cd，...]。

如何递归生成多字词？

4 个答案: