算法
1-输入触发句 - 仅在程序开头一次 -
2-获取触发句中最长的单词
3-查找在步骤2中包含该单词的语料库的所有句子
4-随机选择其中一个句子
5-获取在步骤4中选择的句子之后的句子(命名为sentA以解决描述中的歧义) - 只要sentA超过40个字符 -
6-转到第2步,现在触发句是step5的sentA
实施
from nltk.corpus import gutenberg
from random import choice
triggerSentence = raw_input("Please enter the trigger sentence:")#get input sentence from user
previousLongestWord = ""
listOfSents = gutenberg.sents()
listOfWords = gutenberg.words()
corpusSentences = [] #all sentences in the related corpus
sentenceAppender = ""
longestWord = ""
#this function is not mine, code courtesy of Dave Kirby, found on the internet about sorting list without duplication speed tricks
def arraySorter(seq):
seen = set()
return [x for x in seq if x not in seen and not seen.add(x)]
def findLongestWord(longestWord):
if(listOfWords.count(longestWord) == 1 or longestWord.upper() == previousLongestWord.upper()):
longestWord = sortedSetOfValidWords[-2]
if(listOfWords.count(longestWord) == 1):
longestWord = sortedSetOfValidWords[-3]
doappend = corpusSentences.append
def appending():
for mysentence in listOfSents: #sentences are organized into array so they can actually be read word by word.
sentenceAppender = " ".join(mysentence)
doappend(sentenceAppender)
appending()
sentencesContainingLongestWord = []
def getSentence(longestWord, sentencesContainingLongestWord):
for sentence in corpusSentences:
if sentence.count(longestWord):#if the sentence contains the longest target string, push it into the sentencesContainingLongestWord list
sentencesContainingLongestWord.append(sentence)
def lengthCheck(sentenceIndex, triggerSentence, sentencesContainingLongestWord):
while(len(corpusSentences[sentenceIndex + 1]) < 40):#in case the next sentence is shorter than 40 characters, pick another trigger sentence
sentencesContainingLongestWord.remove(triggerSentence)
triggerSentence = choice(sentencesContainingLongestWord)
sentenceIndex = corpusSentences.index(triggerSentence)
while len(triggerSentence) > 0: #run the loop as long as you get a trigger sentence
sentencesContainingLongestWord = []#all the sentences that include the longest word are to be inserted into this set
setOfValidWords = [] #set for words in a sentence that exists in a corpus
split_str = triggerSentence.split()#split the sentence into words
setOfValidWords = [word for word in split_str if listOfWords.count(word)]
sortedSetOfValidWords = arraySorter(sorted(setOfValidWords, key = len))
longestWord = sortedSetOfValidWords[-1]
findLongestWord(longestWord)
previousLongestWord = longestWord
getSentence(longestWord, sentencesContainingLongestWord)
triggerSentence = choice(sentencesContainingLongestWord)
sentenceIndex = corpusSentences.index(triggerSentence)
lengthCheck(sentenceIndex, triggerSentence, sentencesContainingLongestWord)
triggerSentence = corpusSentences[sentenceIndex + 1]#get the sentence that is next to the previous trigger sentence
print triggerSentence
print "\n"
corpusSentences.remove(triggerSentence)#in order to view the sentence index numbers, you can remove this one so index numbers are concurrent with actual gutenberg numbers
print "End of session, please rerun the program"
#initiated once the while loop exits, so that the program ends without errors
我运行代码的计算机有点老了,2006年2月购买了双核CPU,2004年9月购买了2x512内存,所以我不确定我的实现是不好还是硬件是运行缓慢的原因。关于如何从危险形式中解救这个问题的任何想法?提前谢谢。
答案 0 :(得分:4)
我认为我的第一个建议必须是:仔细考虑你的日常工作,并确保名称描述了这一点。目前你有类似的东西:
arraySorter
既不涉及arrays也不排序(它是nub的实施方式)findLongestWord
根据算法描述中没有的条件对事物进行计数或选择单词,但最终什么也不做,因为longestWord是局部变量(参数,实际上是这样)getSentence
将任意数量的句子附加到列表中appending
听起来可能是状态检查员,但只能通过副作用进行操作sentenceAppender
,也不是像名称所暗示的演员(例如,函数)对于任务本身,你真正需要的是指数。索引每个单词可能有点过分 - 从技术上讲,你应该只需要作为句子中最长单词出现的单词的索引条目。字典是您的主要工具,第二个工具是列表。获得这些索引后,查找包含任何给定单词的随机句子只需要字典查找,random.choice和列表查找。考虑到句子长度的限制,可能会有一些列表查找。
这个例子应该是一个很好的对象课程,像Psyco这样的现代硬件或优化器无法解决算法问题。
答案 1 :(得分:1)
也许Psyco加快执行速度?