以下是我的功能代码:
def calcVowelProportion(wordList):
"""
Calculates the proportion of vowels in each word in wordList.
"""
VOWELS = 'aeiou'
ratios = []
for word in wordList:
numVowels = 0
for char in word:
if char in VOWELS:
numVowels += 1
ratios.append(numVowels/float(len(word)))
现在,我正在使用超过87,000个单词的列表,这个算法显然非常慢。
有更好的方法吗?
编辑:
我测试了以下类提供的算法@ExP:
import time
class vowelProportions(object):
"""
A series of methods that all calculate the vowel/word length ratio
in a list of words.
"""
WORDLIST_FILENAME = "words_short.txt"
def __init__(self):
self.wordList = self.buildWordList()
print "Original: " + str(self.calcMeanTime(10000, self.cvpOriginal, self.wordList))
print "Generator: " + str(self.calcMeanTime(10000, self.cvpGenerator, self.wordList))
print "Count: " + str(self.calcMeanTime(10000, self.cvpCount, self.wordList))
print "Translate: " + str(self.calcMeanTime(10000, self.cvpTranslate, self.wordList))
def buildWordList(self):
inFile = open(self.WORDLIST_FILENAME, 'r', 0)
wordList = []
for line in inFile:
wordList.append(line.strip().lower())
return wordList
def cvpOriginal(self, wordList):
""" My original, slow algorithm"""
VOWELS = 'aeiou'
ratios = []
for word in wordList:
numVowels = 0
for char in word:
if char in VOWELS:
numVowels += 1
ratios.append(numVowels/float(len(word)))
return ratios
def cvpGenerator(self, wordList):
""" Using a generator expression """
return [sum(char in 'aeiou' for char in word)/float(len(word)) for word in wordList]
def cvpCount(self, wordList):
""" Using str.count() """
return [sum(word.count(char) for char in 'aeiou')/float(len(word)) for word in wordList]
def cvpTranslate(self, wordList):
""" Using str.translate() """
return [len(word.translate(None, 'bcdfghjklmnpqrstxyz'))/float(len(word)) for word in wordList]
def timeFunc(self, func, *args):
start = time.clock()
func(*args)
return time.clock() - start
def calcMeanTime(self, numTrials, func, *args):
times = [self.timeFunc(func, *args) for x in range(numTrials)]
return sum(times)/len(times)
输出是(对于200个单词的列表):
Original: 0.0005613667
Generator: 0.0008402738
Count: 0.0012531976
Translate: 0.0003343548
令人惊讶的是,Generator和Count甚至比原版更慢(如果我的实现不正确,请告诉我。)
我想测试@ John的解决方案,但对树木一无所知。
答案 0 :(得分:4)
你应该优化最里面的循环。
我很确定有几种替代方法。这是我现在能想到的。我不确定他们将如何比较速度(相对于彼此和你的解决方案)。
使用生成器表达式:
numVowels = sum(x in 'aeiou' for x in word)
使用str.count()
:
numVowels = sum(word.count(x) for x in 'aeiou')
使用str.translate()
(假设没有大写字母或特殊符号):
numVowels = len(word.translate(None, 'bcdfghjklmnpqrstxyz'))
通过所有这些,你甚至可以在没有list.append()
的情况下将整个函数写在一行中。
我很想知道哪个是最快的。
答案 1 :(得分:4)
由于您只关心每个单词中元音与字母的比例,您可以先用a
替换所有元音。现在你可以尝试一些可能更快的事情:
a
)到非元音的点。这是一种树形结构。单词中的字母数是树的级别。元音的数量是左分支的数量。答案 2 :(得分:1)
使用正则表达式匹配元音列表并计算匹配数。
>>> import re
>>> s = 'supercalifragilisticexpialidocious'
>>> len(re.findall('[aeiou]', s))
16
答案 3 :(得分:0)
for word in wordlist:
numVowels = 0
for letter in VOWELS:
numVowels += word.count(letter)
ratios.append(numVowels/float(len(word)))
减少决策,应该意味着更少的时间,也使用内置的东西,我相信工作更快。
答案 4 :(得分:0)
import timeit
words = 'This is a test string'
def vowelProportions(words):
counts, vowels = {}, 'aeiou'
wordLst = words.lower().split()
for word in wordLst:
counts[word] = float(sum(word.count(v) for v in vowels)) / len(word)
return counts
def f():
return vowelProportions(words)
print timeit.timeit(stmt = f, number = 17400) # 5 (len of words) * 17400 = 87,000
# 0.838676
答案 5 :(得分:0)
以下是如何在Linux上使用一个命令行计算它: -
cat wordlist.txt | tr -d aeiouAEIOU | paste - wordlist.txt | gawk '{ FS="\t"; RATIO = length($1)/ length($2); print $2, RATIO }'
输出:
aa 0
ab 0.5
abs 0.666667
注意:wordlist.txt
中的每一行都包含一个单词。空行将产生除以零的错误