Question

假设我有一个巨大的.txt文件，里面装满了随机字符，我想找出＆＃34;稀有的＆＃34;。在Python中是否有一些模块（实际上是某些东西）（可能是版本3.x，但我还有一台使用Python 2.7的机器，如果它更好的话）就是为此而编写的？如果是肯定答案，我在哪里可以找到其功能的基本解释？非常感谢你。

Answer 1

from collections import Counter

c = Counter("text")
print(c.most_common())

输出

[('t', 2), ('e', 1), ('x', 1)]

Answer 2

d = {}
for c in open(filename, "r").read():
    if c in d:
        d[c] += 1
    else:
        d[c] = 1

print(d)

然后您可以使用d搜索最少的字母。

Answer 3

这是使用Counter字典执行此操作的一种方法。它会打印罕见字符及其出现次数。我们将罕见字符定义为其出现次数小于特定阈值的字符，即出现次数乘以加权因子，在本例中我将其设置为0.5。

from collections import Counter

with open(fname, 'r') as f:
    text = f.read()

counter = Counter(text)
mean = len(text) / len(counter)
print('Mean:', mean)

weight = 0.5
thresh = mean * weight
print('Threshold:', thresh)

#Only print results for chars whose occurence is less than the threshold
for ch, count in reversed(counter.most_common()):
    if count <= thresh:
        print('{0!r}: {1}'.format(ch, count))
    else:
        break

如果这是一个实际的文本文件，您可能希望过滤掉某些字符，例如换行符和空格。

Answer 4

使用collections选项访问最不常见的元素c.most_common()[:-n-1:-1]

from collections import Counter
c = Counter("sadaffdsagfgdfaafsasdfs3213jlkjk22jl31j2k13j313j13")
res = c.most_common()[:-3-1:-1]
print "The 3 Rarest characters are:",res[0][0],",",res[1][0],"and",res[2][0]

结果：

The 3 Rarest characters are: l , g and k

Answer 5

要在文字中找到10个最稀有字符：

from collections import Counter

rarest_chars = Counter(text).most_common()[-10:]

“character”在这里表示简单的Unicode代码点：它表示"a"和"A"被视为不同的字符。这意味着u'g̈'（U + 0067 U + 0308）被视为两个字符。了解如何在相关问题中处理这些问题：Most common character in a string。

使用heapq.nsmallest(10, counter.items(), key=itemgetter(1)) 可以更有效地编写

counter.most_common()[-10:]：.items()返回对(character, its_count)和key=itemgetter(1)提取计数，以便10与返回的计数最少。

使用Python查找稀有字符

5 个答案: