Question

以下是我的功能代码：

def calcVowelProportion(wordList):
    """
    Calculates the proportion of vowels in each word in wordList.
    """

    VOWELS = 'aeiou'
    ratios = []

    for word in wordList:
        numVowels = 0
        for char in word:
            if char in VOWELS:
                numVowels += 1
        ratios.append(numVowels/float(len(word)))

现在，我正在使用超过87,000个单词的列表，这个算法显然非常慢。

有更好的方法吗？

编辑：

我测试了以下类提供的算法@ExP：

    import time

    class vowelProportions(object):
        """
        A series of methods that all calculate the vowel/word length ratio
        in a list of words.
        """

        WORDLIST_FILENAME = "words_short.txt"

        def __init__(self):
            self.wordList = self.buildWordList()
            print "Original: " + str(self.calcMeanTime(10000, self.cvpOriginal, self.wordList))
            print "Generator: " + str(self.calcMeanTime(10000, self.cvpGenerator, self.wordList))
            print "Count: " + str(self.calcMeanTime(10000, self.cvpCount, self.wordList))
            print "Translate: " + str(self.calcMeanTime(10000, self.cvpTranslate, self.wordList))

        def buildWordList(self):
            inFile = open(self.WORDLIST_FILENAME, 'r', 0)
            wordList = []
            for line in inFile:
                wordList.append(line.strip().lower())
            return wordList

        def cvpOriginal(self, wordList):
            """ My original, slow algorithm"""
            VOWELS = 'aeiou'
            ratios = []

            for word in wordList:
                numVowels = 0
                for char in word:
                    if char in VOWELS:
                        numVowels += 1
                ratios.append(numVowels/float(len(word)))

            return ratios

        def cvpGenerator(self, wordList):
            """ Using a generator expression """
            return [sum(char in 'aeiou' for char in word)/float(len(word)) for word in wordList]

        def cvpCount(self, wordList):
            """ Using str.count() """
            return [sum(word.count(char) for char in 'aeiou')/float(len(word)) for word in wordList]

        def cvpTranslate(self, wordList):
            """ Using str.translate() """
            return [len(word.translate(None, 'bcdfghjklmnpqrstxyz'))/float(len(word)) for word in wordList]

        def timeFunc(self, func, *args):
            start = time.clock()
            func(*args)
            return time.clock() - start

        def calcMeanTime(self, numTrials, func, *args):
            times = [self.timeFunc(func, *args) for x in range(numTrials)]
            return sum(times)/len(times)

输出是（对于200个单词的列表）：

Original: 0.0005613667
Generator: 0.0008402738
Count: 0.0012531976
Translate: 0.0003343548

令人惊讶的是，Generator和Count甚至比原版更慢（如果我的实现不正确，请告诉我。）

我想测试@ John的解决方案，但对树木一无所知。

Answer 1

你应该优化最里面的循环。

我很确定有几种替代方法。这是我现在能想到的。我不确定他们将如何比较速度（相对于彼此和你的解决方案）。

使用生成器表达式：

numVowels = sum(x in 'aeiou' for x in word)

使用str.count()：

numVowels = sum(word.count(x) for x in 'aeiou')

使用str.translate()（假设没有大写字母或特殊符号）：
```
numVowels = len(word.translate(None, 'bcdfghjklmnpqrstxyz'))
```

通过所有这些，你甚至可以在没有list.append()的情况下将整个函数写在一行中。

我很想知道哪个是最快的。

Answer 2

由于您只关心每个单词中元音与字母的比例，您可以先用a替换所有元音。现在你可以尝试一些可能更快的事情：

你在每一步测试一个字母而不是五个字母。这肯定会更快。
您可以对整个列表进行排序，并搜索从元音（现在明确表示为a）到非元音的点。这是一种树形结构。单词中的字母数是树的级别。元音的数量是左分支的数量。

Answer 3

使用正则表达式匹配元音列表并计算匹配数。

>>> import re
>>> s = 'supercalifragilisticexpialidocious'
>>> len(re.findall('[aeiou]', s))
16

Answer 4

for word in wordlist:
    numVowels = 0
    for letter in VOWELS:
        numVowels += word.count(letter)
    ratios.append(numVowels/float(len(word)))

减少决策，应该意味着更少的时间，也使用内置的东西，我相信工作更快。

Answer 5

import timeit

words = 'This is a test string'

def vowelProportions(words):
    counts, vowels = {}, 'aeiou'
    wordLst = words.lower().split()
    for word in wordLst:
        counts[word] = float(sum(word.count(v) for v in vowels)) / len(word)
    return counts

def f():
    return vowelProportions(words)

print timeit.timeit(stmt = f, number = 17400) # 5 (len of words) * 17400 = 87,000
# 0.838676

Answer 6

以下是如何在Linux上使用一个命令行计算它： -

cat wordlist.txt | tr -d aeiouAEIOU | paste - wordlist.txt | gawk '{ FS="\t"; RATIO = length($1)/ length($2); print $2, RATIO }'

输出：

aa 0
ab 0.5
abs 0.666667

注意：wordlist.txt中的每一行都包含一个单词。空行将产生除以零的错误

在单词列表中计算元音与字长比

6 个答案: