在单词列表中计算元音与字长比

时间:2013-04-22 20:29:54

标签: python

以下是我的功能代码:

def calcVowelProportion(wordList):
    """
    Calculates the proportion of vowels in each word in wordList.
    """

    VOWELS = 'aeiou'
    ratios = []

    for word in wordList:
        numVowels = 0
        for char in word:
            if char in VOWELS:
                numVowels += 1
        ratios.append(numVowels/float(len(word)))

现在,我正在使用超过87,000个单词的列表,这个算法显然非常慢。

有更好的方法吗?

编辑:

我测试了以下类提供的算法@ExP:

    import time

    class vowelProportions(object):
        """
        A series of methods that all calculate the vowel/word length ratio
        in a list of words.
        """

        WORDLIST_FILENAME = "words_short.txt"

        def __init__(self):
            self.wordList = self.buildWordList()
            print "Original: " + str(self.calcMeanTime(10000, self.cvpOriginal, self.wordList))
            print "Generator: " + str(self.calcMeanTime(10000, self.cvpGenerator, self.wordList))
            print "Count: " + str(self.calcMeanTime(10000, self.cvpCount, self.wordList))
            print "Translate: " + str(self.calcMeanTime(10000, self.cvpTranslate, self.wordList))

        def buildWordList(self):
            inFile = open(self.WORDLIST_FILENAME, 'r', 0)
            wordList = []
            for line in inFile:
                wordList.append(line.strip().lower())
            return wordList

        def cvpOriginal(self, wordList):
            """ My original, slow algorithm"""
            VOWELS = 'aeiou'
            ratios = []

            for word in wordList:
                numVowels = 0
                for char in word:
                    if char in VOWELS:
                        numVowels += 1
                ratios.append(numVowels/float(len(word)))

            return ratios

        def cvpGenerator(self, wordList):
            """ Using a generator expression """
            return [sum(char in 'aeiou' for char in word)/float(len(word)) for word in wordList]

        def cvpCount(self, wordList):
            """ Using str.count() """
            return [sum(word.count(char) for char in 'aeiou')/float(len(word)) for word in wordList]

        def cvpTranslate(self, wordList):
            """ Using str.translate() """
            return [len(word.translate(None, 'bcdfghjklmnpqrstxyz'))/float(len(word)) for word in wordList]

        def timeFunc(self, func, *args):
            start = time.clock()
            func(*args)
            return time.clock() - start

        def calcMeanTime(self, numTrials, func, *args):
            times = [self.timeFunc(func, *args) for x in range(numTrials)]
            return sum(times)/len(times)

输出是(对于200个单词的列表):

Original: 0.0005613667
Generator: 0.0008402738
Count: 0.0012531976
Translate: 0.0003343548

令人惊讶的是,Generator和Count甚至比原版更慢(如果我的实现不正确,请告诉我。)

我想测试@ John的解决方案,但对树木一无所知。

6 个答案:

答案 0 :(得分:4)

你应该优化最里面的循环。

我很确定有几种替代方法。这是我现在能想到的。我不确定他们将如何比较速度(相对于彼此和你的解决方案)。

  • 使用生成器表达式:

    numVowels = sum(x in 'aeiou' for x in word)
    
  • 使用str.count()

    numVowels = sum(word.count(x) for x in 'aeiou')
    
  • 使用str.translate()(假设没有大写字母或特殊符号):

    numVowels = len(word.translate(None, 'bcdfghjklmnpqrstxyz'))
    

通过所有这些,你甚至可以在没有list.append()的情况下将整个函数写在一行中。

我很想知道哪个是最快的。

答案 1 :(得分:4)

由于您只关心每个单词中元音与字母的比例,您可以先用a替换所有元音。现在你可以尝试一些可能更快的事情:

  • 你在每一步测试一个字母而不是五个字母。这肯定会更快。
  • 您可以对整个列表进行排序,并搜索从元音(现在明确表示为a)到非元音的点。这是一种树形结构。单词中的字母数是树的级别。元音的数量是左分支的数量。

答案 2 :(得分:1)

使用正则表达式匹配元音列表并计算匹配数。

>>> import re
>>> s = 'supercalifragilisticexpialidocious'
>>> len(re.findall('[aeiou]', s))
16

答案 3 :(得分:0)

for word in wordlist:
    numVowels = 0
    for letter in VOWELS:
        numVowels += word.count(letter)
    ratios.append(numVowels/float(len(word)))

减少决策,应该意味着更少的时间,也使用内置的东西,我相信工作更快。

答案 4 :(得分:0)

import timeit

words = 'This is a test string'

def vowelProportions(words):
    counts, vowels = {}, 'aeiou'
    wordLst = words.lower().split()
    for word in wordLst:
        counts[word] = float(sum(word.count(v) for v in vowels)) / len(word)
    return counts

def f():
    return vowelProportions(words)

print timeit.timeit(stmt = f, number = 17400) # 5 (len of words) * 17400 = 87,000
# 0.838676

答案 5 :(得分:0)

以下是如何在Linux上使用一个命令行计算它: -

cat wordlist.txt | tr -d aeiouAEIOU | paste - wordlist.txt | gawk '{ FS="\t"; RATIO = length($1)/ length($2); print $2, RATIO }'

输出:

aa 0
ab 0.5
abs 0.666667

注意:wordlist.txt中的每一行都包含一个单词。空行将产生除以零的错误