Question

比方说，我有一个.txt文件，其中的短语用换行符（\n）

我将它们分成短语列表

["Rabbit eats banana", "Fox eats apple", "bear eats sanwich", "Tiger sleeps"]

我需要做什么：

我需要输入list of word objects，每个词应具有：

名称
频率（它在短语中出现了多少次）
它所属的短语列表

对于单词eats，结果将是：

{'name':'eats', 'frequency': '3', 'phrases': [0,1,2]}

我已经完成的工作：

现在，我这样做很简单，但效果不佳：

我得到单词列表（通过将.txt文件用空格字符（“”）分开

words = split_my_input_file_by_spaces
#["banana", 'eats', 'apple', ....]

为每个单词和每个短语循环：

for word in words:
    for phrase in phrases:
       if word in phrase:
          #add word freq +1

当前方法有什么问题？

我最多可以有1万个词组，因此我在速度和性能方面遇到了一些问题。我想使其更快

我看到了一种有趣的很有前途的计数发生方式（但是我不知道如何列出每个单词所属的短语列表）

from collections import Counter
list1=['apple','egg','apple','banana','egg','apple']
counts = Counter(list1)
print(counts)
# Counter({'apple': 3, 'egg': 2, 'banana': 1})

Answer 1

您所指的（一种有趣且有希望的计数发生方式）被称为HashMap，或Python中的字典。这些是键值存储，使您可以通过恒定时间检索来存储和更新某些值（例如计数，短语列表或Word对象）。

您提到您遇到了一些运行时问题。切换到基于HashMap的方法将显着加快算法的运行速度（从二次到线性）。

int was_a_new_line = 1;
while((c = fgetc(file)) != EOF)
{ 
    if(c == '\n')
    {
        if(!was_a_new_line)
        {
             was_a_new_line = 1;
             count++;
        }
    }
    else
    {
        was_a_new_line = 0;
        if(c == ' ')
        {
             count++;
        }
    }
}

输出：

phrases = ["Hello there", "Hello where"];
wordCounts = {};
wordPhrases = {};

for phrase in phrases:
    for word in phrase.split():
        if (wordCounts.get(word)):
            wordCounts[word] = wordCounts[word] + 1
            wordPhrases[word].append(phrase)
        else:
            wordCounts[word] = 1
            wordPhrases[word] = [phrase]

print(wordCounts)
print(wordPhrases)

这将给您留下两个字典：

每个单词的出现频率：{'there': 1, 'where': 1, 'Hello': 2} {'there': ['Hello there'], 'where': ['Hello where'], 'Hello': ['Hello there', 'Hello where'] }
每个单词出现在哪些短语中：{word: count}

从这一刻开始，您需要付出一些努力才能实现所需的输出。

Python在短语列表中查找单词出现，并将单词链接到短语

1 个答案: