Question

在Python中，我已经创建了一个文本生成器，它可以对某些参数起作用，但是我的代码在大多数情况下是-at-slow并且执行速度低于我的预期。我希望每3-4分钟有一个句子，但是如果它工作的数据库很大就不符合 - 我使用项目Gutenberg的18本书语料库，我将创建我的自定义语料库并添加更多书籍，以便表现至关重要.-算法和实现如下：

算法

1-输入触发句 - 仅在程序开头一次 -

2-获取触发句中最长的单词

3-查找在步骤2中包含该单词的语料库的所有句子

4-随机选择其中一个句子

5-获取在步骤4中选择的句子之后的句子（命名为sentA以解决描述中的歧义） - 只要sentA超过40个字符 -

6-转到第2步，现在触发句是step5的sentA

实施

from nltk.corpus import gutenberg
from random import choice

triggerSentence = raw_input("Please enter the trigger sentence:")#get input sentence from user

previousLongestWord = ""

listOfSents = gutenberg.sents()
listOfWords = gutenberg.words()
corpusSentences = [] #all sentences in the related corpus

sentenceAppender = ""

longestWord = ""

#this function is not mine, code courtesy of Dave Kirby, found on the internet about sorting list without duplication speed tricks
def arraySorter(seq):
    seen = set()
    return [x for x in seq if x not in seen and not seen.add(x)]


def findLongestWord(longestWord):
    if(listOfWords.count(longestWord) == 1 or longestWord.upper() == previousLongestWord.upper()):
        longestWord = sortedSetOfValidWords[-2]
        if(listOfWords.count(longestWord) == 1):
            longestWord = sortedSetOfValidWords[-3]


doappend = corpusSentences.append

def appending():

    for mysentence in listOfSents: #sentences are organized into array so they can actually be read word by word.
        sentenceAppender = " ".join(mysentence)
        doappend(sentenceAppender)


appending()
sentencesContainingLongestWord = []

def getSentence(longestWord, sentencesContainingLongestWord):


    for sentence in corpusSentences:
        if sentence.count(longestWord):#if the sentence contains the longest target string, push it into the sentencesContainingLongestWord list
            sentencesContainingLongestWord.append(sentence)


def lengthCheck(sentenceIndex, triggerSentence, sentencesContainingLongestWord):

    while(len(corpusSentences[sentenceIndex + 1]) < 40):#in case the next sentence is shorter than 40 characters, pick another trigger sentence
        sentencesContainingLongestWord.remove(triggerSentence)
        triggerSentence = choice(sentencesContainingLongestWord)
        sentenceIndex = corpusSentences.index(triggerSentence)

while len(triggerSentence) > 0: #run the loop as long as you get a trigger sentence

    sentencesContainingLongestWord = []#all the sentences that include the longest word are to be inserted into this set

    setOfValidWords = [] #set for words in a sentence that exists in a corpus                    

    split_str = triggerSentence.split()#split the sentence into words

    setOfValidWords = [word for word in split_str if listOfWords.count(word)]

    sortedSetOfValidWords = arraySorter(sorted(setOfValidWords, key = len))

    longestWord = sortedSetOfValidWords[-1]

    findLongestWord(longestWord)

    previousLongestWord = longestWord

    getSentence(longestWord, sentencesContainingLongestWord)

    triggerSentence = choice(sentencesContainingLongestWord)

    sentenceIndex = corpusSentences.index(triggerSentence)

    lengthCheck(sentenceIndex, triggerSentence, sentencesContainingLongestWord)

    triggerSentence = corpusSentences[sentenceIndex + 1]#get the sentence that is next to the previous trigger sentence

    print triggerSentence
    print "\n"

    corpusSentences.remove(triggerSentence)#in order to view the sentence index numbers, you can remove this one so index numbers are concurrent with actual gutenberg numbers


print "End of session, please rerun the program"
#initiated once the while loop exits, so that the program ends without errors

我运行代码的计算机有点老了，2006年2月购买了双核CPU，2004年9月购买了2x512内存，所以我不确定我的实现是不好还是硬件是运行缓慢的原因。关于如何从危险形式中解救这个问题的任何想法？提前谢谢。

Answer 1

我认为我的第一个建议必须是：仔细考虑你的日常工作，并确保名称描述了这一点。目前你有类似的东西：

arraySorter既不涉及arrays也不排序（它是nub的实施方式）
findLongestWord根据算法描述中没有的条件对事物进行计数或选择单词，但最终什么也不做，因为longestWord是局部变量（参数，实际上是这样）
getSentence将任意数量的句子附加到列表中
appending听起来可能是状态检查员，但只能通过副作用进行操作
本地变量和全局变量之间存在相当大的混淆，例如从未使用过全局变量sentenceAppender，也不是像名称所暗示的演员（例如，函数）

对于任务本身，你真正需要的是指数。索引每个单词可能有点过分 - 从技术上讲，你应该只需要作为句子中最长单词出现的单词的索引条目。字典是您的主要工具，第二个工具是列表。获得这些索引后，查找包含任何给定单词的随机句子只需要字典查找，random.choice和列表查找。考虑到句子长度的限制，可能会有一些列表查找。

这个例子应该是一个很好的对象课程，像Psyco这样的现代硬件或优化器无法解决算法问题。

Answer 2

也许Psyco加快执行速度？

如何让这段代码更快地运行？（搜索大型文本的大型语料库）

2 个答案:

如何让这段代码更快地运行？ （搜索大型文本的大型语料库）

2 个答案:

如何让这段代码更快地运行？（搜索大型文本的大型语料库）