说我有一串单词:'a b c d e f'
。我想从这个字符串中生成一个多字词的列表。
字顺序很重要。不应从上面的示例生成术语'f e d'
。
修改:此外,不应跳过字词。 <{1}}或'a c'
不应该生成。
我现在拥有的东西:
'b d f'
打印
doc = 'a b c d e f'
terms= []
one_before = None
two_before = None
for word in doc.split(None):
terms.append(word)
if one_before:
terms.append(' '.join([one_before, word]))
if two_before:
terms.append(' '.join([two_before, one_before, word]))
two_before = one_before
one_before = word
for term in terms:
print term
我如何将它作为一个递归函数,以便每次传递一个可变的最大单词数?
应用
我将使用它来从HTML文档中的可读文本生成多字词。总体目标是对大型语料库(大约200万个文档)进行潜在的语义分析。这就是为什么保持单词顺序很重要(自然语言处理等等)。
答案 0 :(得分:11)
这不是递归的,但我认为它可以做你想要的。
doc = 'a b c d e f'
words = doc.split(None)
max = 3
for index in xrange(len(words)):
for n in xrange(max):
if index + n < len(words):
print ' '.join(words[index:index+n+1])
这是一个递归解决方案:
def find_terms(words, max_words_per_term):
if len(words) == 0: return []
return [" ".join(words[:i+1]) for i in xrange(min(len(words), max_words_per_term))] + find_terms(words[1:], max_words_per_term)
doc = 'a b c d e f'
words = doc.split(None)
for term in find_terms(words, 3):
print term
这里再次使用递归函数,其中一些解释变量和注释。
def find_terms(words, max_words_per_term):
# If there are no words, you've reached the end. Stop.
if len(words) == 0:
return []
# What's the max term length you could generate from the remaining
# words? It's the lesser of max_words_per_term and how many words
# you have left.
max_term_len = min(len(words), max_words_per_term)
# Find all the terms that start with the first word.
initial_terms = [" ".join(words[:i+1]) for i in xrange(max_term_len)]
# Here's the recursion. Find all of the terms in the list
# of all but the first word.
other_terms = find_terms(words[1:], max_words_per_term)
# Now put the two lists of terms together to get the answer.
return initial_terms + other_terms
答案 1 :(得分:3)
我建议您将函数设置为生成器,然后生成所需数量的术语。您需要将print
更改为yield
(并显然可以创建整个块功能)。
您也可以查看itertools模块,它对您的工作非常有用。
答案 2 :(得分:3)
你为什么要这样做?您只需使用for循环和itertools.combinations()
。
答案 3 :(得分:1)
您正在寻找的是N-gram算法。那会给你[a,ab,b,bc,c,cd,...]。