Python代码流不能按预期工作?

时间:2010-08-26 03:42:08

标签: python parsing text nltk

我正在尝试通过正则表达式和python的NLTK处理各种文本 - 这是http://www.nltk.org/book-。我正在尝试创建一个随机文本生成器,我有一个小问题。首先,这是我的代码流程:

  1. 输入一个句子作为输入 - 这称为触发字符串,分配给变量 -

  2. 获取触发器字符串中最长的单词

  3. 在所有Project Gutenberg数据库中搜索包含此单词的句子 - 无大写小写 -

  4. 返回包含我在第3步中提到的单词的最长句子

  5. 将步骤1和步骤4中的句子一起附加

  6. 将步骤4中的句子指定为新的“触发器”句子并重复该过程。请注意,我必须在第二句中获得最长的单词并继续这样,依此类推 -

  7. 到目前为止,我只能做到这一次。当我试图继续这个时,程序只会继续打印我的搜索产生的第一句话。它应该实际上寻找这个新句子中最长的单词并继续应用我上面描述的代码流程。

    下面是我的代码以及示例输入/输出:

    示例输入

      

    “代码塔”

    示例输出

      “代号为挪威人,有可怕的数字,由大多数失败的Traytor协助,Cawdor的Thane,开始了一场惨淡的冲突,直到Bellona的新娘组织,在证明中,与他面对面 - 比较,Point反对Point,叛逆的Arme'获得Arme,遏制他的骄傲精神:并得出结论,The Victorie落在了vs“

    现在这实际上应该采用以'Norway himselfe ....'开头的句子,并寻找其中最长的单词并执行上述步骤,依此类推,但事实并非如此。有什么建议?谢谢。

    import nltk
    
    from nltk.corpus import gutenberg
    
    triggerSentence = raw_input("Please enter the trigger sentence: ")#get input str
    
    split_str = triggerSentence.split()#split the sentence into words
    
    longestLength = 0
    
    longestString = ""
    
    montyPython = 1
    
    while montyPython:
    
        #code to find the longest word in the trigger sentence input
        for piece in split_str:
            if len(piece) > longestLength:
                longestString = piece
                longestLength = len(piece)
    
    
        listOfSents = gutenberg.sents() #all sentences of gutenberg are assigned -list of list format-
    
        listOfWords = gutenberg.words()# all words in gutenberg books -list format-
        # I tip my hat to Mr.Alex Martelli for this part, which helps me find the longest sentence
        lt = longestString.lower() #this line tells you whether word list has the longest word in a case-insensitive way. 
    
        longestSentence = max((listOfWords for listOfWords in listOfSents if any(lt == word.lower() for word in listOfWords)), key = len)
        #get longest sentence -list format with every word of sentence being an actual element-
    
        longestSent=[longestSentence]
    
        for word in longestSent:#convert the list longestSentence to an actual string
            sstr = " ".join(word)
        print triggerSentence + " "+ sstr
        triggerSentence = sstr
    

4 个答案:

答案 0 :(得分:1)

这个怎么样?

  1. 您在触发器中找到最长的单词
  2. 您在包含1中找到的单词的最长句子中找到最长的单词。
  3. 1.的单词是2的句子中最长的单词。
  4. 会发生什么?提示:答案以“无限”开头。要纠正这个问题,你可以找到小写的一组单词是有用的。

    当您认为MontyPython变为False并且程序完成时,顺便说一句?

答案 1 :(得分:1)

不是每次都搜索整个语料库,而是从单词到包含该单词的最长句子构建单个地图可能会更快。这是我(未经测试)尝试这样做的。

import collections
from nltk.corpus import gutenberg

def words_in(sentence):
    """Generate all words in the sentence (lower-cased)"""
    for word in sentence.split():
        word = word.strip('.,"\'-:;')
        if word:
            yield word.lower()

def make_sentence_map(books):
    """Construct a map from words to the longest sentence containing the word."""
    result = collections.defaultdict(str)
    for book in books:
        for sentence in book:
            for word in words_in(sentence):
                if len(sentence) > len(result[word]):
                    result[word] = sent
    return result

def generate_random_text(sentence, sentence_map):
    while True:
        yield sentence
        longest_word = max(words_in(sentence), key=len)
        sentence = sentence_map[longest_word]

sentence_map = make_sentence_map(gutenberg.sents())
for sentence in generate_random_text('Thane of code.', sentence_map): 
    print sentence

答案 2 :(得分:0)

您在循环外部分配“split_str”,因此它获取原始值然后保留它。您需要在while循环的开头分配它,因此每次都会更改。

import nltk

from nltk.corpus import gutenberg

triggerSentence = raw_input("Please enter the trigger sentence: ")#get input str

longestLength = 0

longestString = ""

montyPython = 1

while montyPython:
    #so this is run every time through the loop
    split_str = triggerSentence.split()#split the sentence into words

    #code to find the longest word in the trigger sentence input
    for piece in split_str:
        if len(piece) > longestLength:
            longestString = piece
            longestLength = len(piece)


    listOfSents = gutenberg.sents() #all sentences of gutenberg are assigned -list of list format-

    listOfWords = gutenberg.words()# all words in gutenberg books -list format-
    # I tip my hat to Mr.Alex Martelli for this part, which helps me find the longest sentence
    lt = longestString.lower() #this line tells you whether word list has the longest word in a case-insensitive way. 

    longestSentence = max((listOfWords for listOfWords in listOfSents if any(lt == word.lower() for word in listOfWords)), key = len)
    #get longest sentence -list format with every word of sentence being an actual element-

    longestSent=[longestSentence]

    for word in longestSent:#convert the list longestSentence to an actual string
        sstr = " ".join(word)
    print triggerSentence + " "+ sstr
    triggerSentence = sstr

答案 3 :(得分:0)

先生。汉金的答案更优雅,但以下更符合你开始的方法:

import sys
import string
import nltk
from nltk.corpus import gutenberg

def longest_element(p):
    """return the first element of p which has the greatest len()"""
    max_len = 0
    elem = None
    for e in p:
        if len(e) > max_len:
            elem = e
            max_len = len(e)
    return elem

def downcase(p):
    """returns a list of words in p shifted to lower case"""
    return map(string.lower, p)


def unique_words():
    """it turns out unique_words was never referenced so this is here
       for pedagogy"""
    # there are 2.6 million words in the gutenburg corpus but only ~42k unique
    # ignoring case, let's pare that down a bit
    for word in gutenberg.words():
        words.add(word.lower())
    print 'gutenberg.words() has', len(words), 'unique caseless words'
    return words

print 'loading gutenburg corpus...'
sentences = []
for sentence in gutenberg.sents():
    sentences.append(downcase(sentence))

trigger = sys.argv[1:]
target = longest_element(trigger).lower()
last_target = None

while target != last_target:
    matched_sentences = []
    for sentence in sentences:
        if target in sentence:
            matched_sentences.append(sentence)

    print '===', target, 'matched', len(matched_sentences), 'sentences'
    longestSentence = longest_element(matched_sentences)
    print ' '.join(longestSentence)

    trigger = longestSentence
    last_target = target
    target = longest_element(trigger).lower()

考虑到你的例句,它会在两个周期内达到固定:

  

$ python nltkgut.py代码的代码
  加载古腾堡语料库...
  === target thane匹配24个句子
  挪威杰塞尔,可怕的   数字,由大多数人协助   disloyall tr​​aytor,the thane of   cawdor,开始了一场惨淡的冲突,   直到那个贝罗纳的新娘,   在证明中,与他对峙   自我 - 比较,反对   点,叛逆的arme'最高兴   ,遏制他的懒惰精神:并且   总结一下,胜利者落在了vs   === target bridegroome匹配1个句子
  挪威的海瑟弗   可怕的数字,由此协助   大部分的disloyall tr​​aytor,the thane of the   cawdor,开始了一场惨淡的冲突,   直到那个贝罗纳的新娘,   在证明中,与他对峙   自我 - 比较,反对   点,叛逆的arme'最高兴   ,遏制他的懒惰精神:并且   总结一下,胜利者落在vs

对最后一个问题做出回应的部分问题在于它是按照你的要求做的,但你提出了一个比你想要答案更具体的问题。因此,在我不确定你理解的一些相当复杂的列表表达式中,反应陷入了困境。我建议你更自由地使用print语句,如果你不知道它的作用,就不要导入代码。在展开列表表达式时,我发现(如上所述)您从未使用过语料库单词列表。功能也是一种帮助。