计算单词列表中的字母频率,不包括同一单词中的重复项

时间:2019-01-16 19:06:09

标签: python algorithm

我正在尝试在单词列表中找到最常用的字母。我在算法上苦苦挣扎,因为只需要跳过重复项一次就可以计算单词中的字母频率,所以我需要帮助找到一种方法来计算整个列表中每个单词仅出现一次的字母频率,而无需考虑第二次出现。

例如,如果我有:

words = ["tree", "bone", "indigo", "developer"]

频率为:

letters={a:0, b:1, c:0, d:2, e:3, f:0, g:1, h:0, i:1, j:0, k:0, l:1, m:0, n:2, o:3, p:1, q:0, r:2, s:0, t:1, u:0, v:1, w:0, x:0, y:0, z:0}

从字母词典中可以看到:'e'是3而不是5,因为如果'e'在同一个单词中重复多次,则应将其忽略。

这是我想出的算法,它是用Python实现的:

for word in words:
    count=0;

    for letter in word:
        if(letter.isalpha()):
            if((letters[letter.lower()] > 0  && count == 0) ||
               (letters[letter.lower()] == 0 && count == 0)):

                    letters[letter.lower()]+=1
                    count=1

            elif(letters[letter.lower()]==0 && count==1):   
                letters[letter.lower()]+=1

但是它仍然需要工作,我什么都别想了,对于将帮助我考虑可行解决方案的任何人,我将感到非常高兴。

9 个答案:

答案 0 :(得分:57)

@Primusa答案的变体,不使用更新:

application.yml

输出

refresh()

基本上将每个单词转换为一个集合,然后遍历每个集合。

答案 1 :(得分:18)

创建一个计数器对象,然后使用每个单词的集合对其进行更新:

from collections import Counter

wordlist = ["tree","bone","indigo","developer"]

c = Counter()
for word in wordlist:
    c.update(set(word.lower()))

print(c)

输出:

Counter({'e': 3, 'o': 3, 'r': 2, 'n': 2, 'd': 2, 't': 1, 'b': 1, 'i': 1, 'g': 1, 'v': 1, 'p': 1, 'l': 1})

请注意,尽管wordlist中没有出现字母,但Counter中没有出现字母,这很好,因为Counter的行为类似于defaultdict(int),因此访问不存在的值会自动返回默认值0。

答案 2 :(得分:15)

一个没有计数器的

words=["tree","bone","indigo","developer"]
d={}
for word in words:         # iterate over words
    for i in set(word):    # to remove the duplication of characters within word
        d[i]=d.get(i,0)+1

输出

{'b': 1,
 'd': 2,
 'e': 3,
 'g': 1,
 'i': 1,
 'l': 1,
 'n': 2,
 'o': 3,
 'p': 1,
 'r': 2,
 't': 1,
 'v': 1}

答案 3 :(得分:11)

到目前为止比较的解决方案的速度:

def f1(words):
    c = Counter()
    for word in words:
        c.update(set(word.lower()))
    return c

def f2(words):
    return Counter(
        c
        for word in words
        for c in set(word.lower()))

def f3(words):
    d = {}
    for word in words:
        for i in set(word.lower()):
            d[i] = d.get(i, 0) + 1
    return d

我的计时功能(单词列表使用不同的大小):

word_list = [
    'tree', 'bone', 'indigo', 'developer', 'python',
    'language', 'timeit', 'xerox', 'printer', 'offset',
]

for exp in range(5):
    words = word_list * 10**exp

    result_list = []
    for i in range(1, 4):
        t = timeit.timeit(
            'f(words)',
            'from __main__ import words,  f{} as f'.format(i),
            number=100)
        result_list.append((i, t))

    print('{:10,d} words | {}'.format(
        len(words),
        ' | '.join(
            'f{} {:8.4f} sec'.format(i, t) for i, t in result_list)))

结果:

        10 words | f1   0.0028 sec | f2   0.0012 sec | f3   0.0011 sec
       100 words | f1   0.0245 sec | f2   0.0082 sec | f3   0.0113 sec
     1,000 words | f1   0.2450 sec | f2   0.0812 sec | f3   0.1134 sec
    10,000 words | f1   2.4601 sec | f2   0.8113 sec | f3   1.1335 sec
   100,000 words | f1  24.4195 sec | f2   8.1828 sec | f3  11.2167 sec

具有列表理解功能的Counter(此处为f2())似乎是最快的。使用counter.update()似乎很慢(此处为f1())。

答案 4 :(得分:1)

尝试使用字典理解:

import string
print({k:max(i.count(k) for i in words) for k in string.ascii_lowercase})

答案 5 :(得分:1)

参加聚会有点晚了,但是你去了:

freq = {k: sum(k in word for word in words) for k in set(''.join(words))}

返回:

{'i': 1, 'v': 1, 'p': 1, 'b': 1, 'e': 3, 'g': 1, 't': 1, 'n': 2, 'd': 2, 'o': 3, 'l': 1, 'r': 2}

答案 6 :(得分:1)

from collections import Counter  
import string  

words=["tree","bone","indigo","developer"]  
y=Counter(string.ascii_lowercase)  
new_dict=dict(y) 

for k in new_dict:  
    new_dict[k]=0  
trial = 0  
while len(words) > trial:  
    for let in set(words[trial]):    
        if let in new_dict:  
            new_dict[str(let)]=new_dict[str(let)]+1  

    trial = trial +1  
print(new_dict)

答案 7 :(得分:0)

其他解决方案也不错,但具体来说,它们不包含零频率的字母。这是一种方法,但是比其他方法慢2-3倍。

import string
counts = {c: len([w for w in words if c in w.lower()]) for c in string.ascii_lowercase}

产生这样的字典:

{'a': 4, 'b': 2, 'c': 2, 'd': 4, 'e': 7, 'f': 2, 'g': 2, 'h': 3, 'i': 7, 'j': 0, 'k': 0, 'l': 4, 'm': 5, 'n': 4, 'o': 4, 'p': 1, 'q': 0, 'r': 5, 's': 3, 't': 3, 'u': 2, 'v': 0, 'w': 3, 'x': 0, 'y': 2, 'z': 1}

这是我对拉尔夫时间的更新:

        10 words | f1   0.0004 sec | f2   0.0004 sec | f3   0.0003 sec | f4   0.0010 sec
       100 words | f1   0.0019 sec | f2   0.0014 sec | f3   0.0013 sec | f4   0.0034 sec
     1,000 words | f1   0.0180 sec | f2   0.0118 sec | f3   0.0140 sec | f4   0.0298 sec
    10,000 words | f1   0.1960 sec | f2   0.1278 sec | f3   0.1542 sec | f4   0.2648 sec
   100,000 words | f1   2.0859 sec | f2   1.3971 sec | f3   1.6815 sec | f4   3.5196 sec

基于以下代码和https://github.com/dwyl/english-words/的单词列表

import string
import timeit
import random
from collections import Counter

def f1(words):
    c = Counter()
    for word in words:
        c.update(set(word.lower()))
    return c

def f2(words):
    return Counter(
        c
        for word in words
        for c in set(word.lower()))

def f3(words):
    d = {}
    for word in words:
        for i in set(word.lower()):
            d[i] = d.get(i, 0) + 1
    return d


def f4(words):
    d = {c: len([w for w in words if c in w.lower()]) for c in string.ascii_lowercase} 
    return d


with open('words.txt') as word_file:
    valid_words = set(word_file.read().split())

for exp in range(5):

    result_list = []
    for i in range(1, 5):
        t = timeit.timeit(
            'f(words)',
            'from __main__ import f{} as f, valid_words, exp; import random; words = random.sample(valid_words, 10**exp)'.format(i),
            number=100)
        result_list.append((i, t))

    print('{:10,d} words | {}'.format(
        len(words),
        ' | '.join(
            'f{} {:8.4f} sec'.format(i, t) for i, t in result_list)))

print(f4(random.sample(valid_words, 10000)))
print(f4(random.sample(valid_words, 1000)))
print(f4(random.sample(valid_words, 100)))
print(f4(random.sample(valid_words, 10)))

答案 8 :(得分:0)

LinkedHashSet