如何让这段代码更快地运行? (搜索大型文本的大型语料库)

时间:2011-04-22 20:56:26

标签: python optimization time performance

在Python中,我已经创建了一个文本生成器,它可以对某些参数起作用,但是我的代码在大多数情况下是-at-slow并且执行速度低于我的预期。我希望每3-4分钟有一个句子,但是如果它工作的数据库很大就不符合 - 我使用项目Gutenberg的18本书语料库,我将创建我的自定义语料库并添加更多书籍,以便表现至关重要.-算法和实现如下:

算法

1-输入触发句 - 仅在程序开头一次 -

2-获取触发句中最长的单词

3-查找在步骤2中包含该单词的语料库的所有句子

4-随机选择其中一个句子

5-获取在步骤4中选择的句子之后的句子(命名为sentA以解决描述中的歧义) - 只要sentA超过40个字符 -

6-转到第2步,现在触发句是step5的sentA

实施

from nltk.corpus import gutenberg
from random import choice

triggerSentence = raw_input("Please enter the trigger sentence:")#get input sentence from user

previousLongestWord = ""

listOfSents = gutenberg.sents()
listOfWords = gutenberg.words()
corpusSentences = [] #all sentences in the related corpus

sentenceAppender = ""

longestWord = ""

#this function is not mine, code courtesy of Dave Kirby, found on the internet about sorting list without duplication speed tricks
def arraySorter(seq):
    seen = set()
    return [x for x in seq if x not in seen and not seen.add(x)]


def findLongestWord(longestWord):
    if(listOfWords.count(longestWord) == 1 or longestWord.upper() == previousLongestWord.upper()):
        longestWord = sortedSetOfValidWords[-2]
        if(listOfWords.count(longestWord) == 1):
            longestWord = sortedSetOfValidWords[-3]


doappend = corpusSentences.append

def appending():

    for mysentence in listOfSents: #sentences are organized into array so they can actually be read word by word.
        sentenceAppender = " ".join(mysentence)
        doappend(sentenceAppender)


appending()
sentencesContainingLongestWord = []

def getSentence(longestWord, sentencesContainingLongestWord):


    for sentence in corpusSentences:
        if sentence.count(longestWord):#if the sentence contains the longest target string, push it into the sentencesContainingLongestWord list
            sentencesContainingLongestWord.append(sentence)


def lengthCheck(sentenceIndex, triggerSentence, sentencesContainingLongestWord):

    while(len(corpusSentences[sentenceIndex + 1]) < 40):#in case the next sentence is shorter than 40 characters, pick another trigger sentence
        sentencesContainingLongestWord.remove(triggerSentence)
        triggerSentence = choice(sentencesContainingLongestWord)
        sentenceIndex = corpusSentences.index(triggerSentence)

while len(triggerSentence) > 0: #run the loop as long as you get a trigger sentence

    sentencesContainingLongestWord = []#all the sentences that include the longest word are to be inserted into this set

    setOfValidWords = [] #set for words in a sentence that exists in a corpus                    

    split_str = triggerSentence.split()#split the sentence into words

    setOfValidWords = [word for word in split_str if listOfWords.count(word)]

    sortedSetOfValidWords = arraySorter(sorted(setOfValidWords, key = len))

    longestWord = sortedSetOfValidWords[-1]

    findLongestWord(longestWord)

    previousLongestWord = longestWord

    getSentence(longestWord, sentencesContainingLongestWord)

    triggerSentence = choice(sentencesContainingLongestWord)

    sentenceIndex = corpusSentences.index(triggerSentence)

    lengthCheck(sentenceIndex, triggerSentence, sentencesContainingLongestWord)

    triggerSentence = corpusSentences[sentenceIndex + 1]#get the sentence that is next to the previous trigger sentence

    print triggerSentence
    print "\n"

    corpusSentences.remove(triggerSentence)#in order to view the sentence index numbers, you can remove this one so index numbers are concurrent with actual gutenberg numbers


print "End of session, please rerun the program"
#initiated once the while loop exits, so that the program ends without errors

我运行代码的计算机有点老了,2006年2月购买了双核CPU,2004年9月购买了2x512内存,所以我不确定我的实现是不好还是硬件是运行缓慢的原因。关于如何从危险形式中解救这个问题的任何想法?提前谢谢。

2 个答案:

答案 0 :(得分:4)

我认为我的第一个建议必须是:仔细考虑你的日常工作,并确保名称描述了这一点。目前你有类似的东西:

  • arraySorter既不涉及arrays也不排序(它是nub的实施方式)
  • findLongestWord根据算法描述中没有的条件对事物进行计数或选择单词,但最终什么也不做,因为longestWord是局部变量(参数,实际上是这样)
  • getSentence将任意数量的句子附加到列表中
  • appending听起来可能是状态检查员,但只能通过副作用进行操作
  • 本地变量和全局变量之间存在相当大的混淆,例如从未使用过全局变量sentenceAppender,也不是像名称所暗示的演员(例如,函数)

对于任务本身,你真正需要的是指数。索引每个单词可能有点过分 - 从技术上讲,你应该只需要作为句子中最长单词出现的单词的索引条目。字典是您的主要工具,第二个工具是列表。获得这些索引后,查找包含任何给定单词的随机句子只需要字典查找,random.choice和列表查找。考虑到句子长度的限制,可能会有一些列表查找。

这个例子应该是一个很好的对象课程,像Psyco这样的现代硬件或优化器无法解决算法问题。

答案 1 :(得分:1)

也许Psyco加快执行速度?