今天是我所有编程知识似乎都失败的日子之一,并且通过IV管理的咖啡量没有帮助这种情况。
我看到了一个短语列表,这里有一些例子
"tax policies when emigrating from uk"
"shipping to scotland from california"
"immigrating to sweden"
"shipping good to australia"
"shipping good to new zealand"
"how to emigrate to california from the uk"
"shipping services from london to usa"
"cost of shipping from usa to uk"
现在我需要开始对此进行单词频率分析,幸好在python中这非常简单,我构造了以下函数来获取此列表并返回最常用单词的Counter
。
from collections import Counter
def count(phrases):
counter = Counter()
for phrase in phrases:
for word in phrase.split(" "):
counter[word] += 1
return counter
这很摇滚,因为现在我可以轻松地从短语列表中获取最常用的单词count(phrases).most_common(5)
现在变得更难了。假设我设置了一个任意深度,让我们说5.鉴于该列表中最受欢迎的单词(这不是粘合词,例如来自或来自和)正在发货。我现在需要接受运费这个词,并再次计算包含运输条款的所有短语,再次大多数都是简单的。
def filter_for_word(word, phrases):
return filter(lambda x: word in x, phrases)
count(filter_for_word("shipping", phrases))
这是它开始变得毛茸茸的地方,我需要继续往下走,直到我达到我的深度。然后我需要能够将这些信息与最常见的短语一起显示出来。
我开始尝试使用以下功能执行此操作,但我无法理解接下来的几个步骤将内容绑定在一起并以良好的结构和格式显示它。
def dive(depth, num, phrases):
phrase_tree = {}
for word, value in dict(count(phrases).most_common(num)).iteritems():
phrase_tree[word] = [value, {}]
current = phrase_tree
while True:
if depth == 0:
return phrase_tree
for word in current:
current[word][1] = {key: [v, {}] for (key, v) in count(filter_for_word(word, phrases)).most_common(num)}
# debug!!
return current
如果有人能帮我把这一切都放在一起我会非常感激
答案 0 :(得分:0)
def filter_for_words(words, phrases):
for word in words:
phrases = filter(lambda x: word in x, phrases)
return phrases
def dive(depth, num, phrases, phrase_tree=None, f_words=None):
if not phrase_tree:
phrase_tree = {}
for word, value in dict(count(phrases).most_common(num)).iteritems():
phrase_tree[word] = [value, {}]
if not f_words:
f_words = []
while True:
if depth == 0:
return phrase_tree
for word in phrase_tree:
words = f_words[:]
words.append(word)
child_tree = {key: [v, {}] for (key, v) in count(filter_for_words(words, phrases)).most_common(num)}
phrase_tree[word][1] = child_tree
dive(depth-1, num, phrases, child_tree, words)
return phrase_tree
效率不高但应该有效。