Question

我有一个包含10,000字的文件。我编写了一个程序来从该文件中找到字谜词，但是它花了太多时间才能输出。对于小型文件程序，效果很好。尝试优化代码。

count=0
i=0
j=0
with open('file.txt') as file:
  lines = [i.strip() for i in file]
  for i in range(len(lines)):
      for j in range(i):
          if sorted(lines[i]) == sorted(lines[j]):
              #print(lines[i])
              count=count+1
              j=j+1
              i=i+1
print('There are ',count,'anagram words')

Answer 1

我不完全理解您的代码（例如，为什么要在循环内递增i和j？）。但是主要问题是您有一个嵌套循环，这使得算法的运行时间为 O（n ^ 2），即，如果文件变大10倍，则执行时间将变为（大约））的100倍长。

因此，您需要一种避免这种情况的方法。一种可能的方法是以一种更智能的方式存储行，这样您就不必每次都遍历所有行。然后，运行时变为 O（n）。在这种情况下，您可以使用字谜包含相同字符（仅以不同顺序）的事实。因此，您可以将“ sorted”变体用作字典中的键，以将可以由同一字母组成的所有行存储在同一字典键下的列表中。当然还有其他可能性，但在这种情况下，我认为效果很好：-）

因此，完全正常的示例代码：

#!/usr/bin/env python3

from collections import defaultdict

d = defaultdict(list)
with open('file.txt') as file:
    lines = [line.strip() for line in file]
    for line in lines:
        sorted_line = ''.join(sorted(line))
        d[sorted_line].append(line)

anagrams = [d[k] for k in d if len(d[k]) > 1]
# anagrams is a list of lists of lines that are anagrams

# I would say the number of anagrams is:
count = sum(map(len, anagrams))
# ... but in your example your not counting the first words, only the "duplicates", so:
count -= len(anagrams)
print('There are', count, 'anagram words')

更新

没有重复项，也没有使用集合（尽管我强烈建议使用它）：

#!/usr/bin/env python3

d = {}
with open('file.txt') as file:
    lines = [line.strip() for line in file]
    lines = set(lines)  # remove duplicates
    for line in lines:
        sorted_line = ''.join(sorted(line))
        if sorted_line in d:
            d[sorted_line].append(line)
        else:
            d[sorted_line] = [line]

anagrams = [d[k] for k in d if len(d[k]) > 1]
# anagrams is a list of lists of lines that are anagrams

# I would say the number of anagrams is:
count = sum(map(len, anagrams))
# ... but in your example your not counting the first words, only the "duplicates", so:
count -= len(anagrams)
print('There are', count, 'anagram words')

Answer 2

目前尚不清楚您是否考虑重复项，但是如果您不这样做，则可以从单词列表中删除重复项，我认为这将为您节省大量的运行时间。您可以检查字谜，然后使用Pipe()获取其总数。这应该做到：

pipes.pop_back();
pipes.insert(pipes.begin(),Pipe(inf * 160);
inf++;

大文件的字谜

2 个答案: