Question

编辑：此代码已作为基本模块处理并发布：https://github.com/hyperreality/Poetry-Tools

我是一名语言学家，他最近选择了python，我正在开展一个项目，希望自动分析诗歌，包括检测诗歌的形式。即如果它找到一个带有0101010101应力模式的10音节线，它会声明它是抑扬格五音阶。一首具有5-7-5音节模式的诗将是一个ha句。

我正在使用以下代码，这是一个更大的脚本的一部分，但我在程序下面列出了许多问题：

脚本中的语料库只是诗歌的原始文本输入。

import sys, getopt, nltk, re, string
from nltk.tokenize import RegexpTokenizer
from nltk.util import bigrams, trigrams
from nltk.corpus import cmudict
from curses.ascii import isdigit

...

def cmuform():
    tokens = [word for sent in nltk.sent_tokenize(corpus) for word in nltk.word_tokenize(sent)]
    d = cmudict.dict()
    text = nltk.Text(tokens)
    words = [w.lower() for w in text]
    regexp = "[A-Za-z]+"
    exp = re.compile(regexp)

    def nsyl(word):
        lowercase = word.lower()
        if lowercase not in d:
                return 0
        else:
            first = [' '.join([str(c) for c in lst]) for lst in max(d[lowercase])]
            second = ''.join(first)
            third = ''.join([i for i in second if i.isdigit()]).replace('2', '1')
            return third 
                #return max([len([y for y in x if isdigit(y[-1])]) for x in d[lowercase]])      

    sum1 = 0
    for a in words:
            if exp.match(a):
            print a,nsyl(a),
                sum1 = sum1 + len(str(nsyl(a)))

    print "\nTotal syllables:",sum1

我想我想要的输出是这样的：

1101111101

0101111001

1101010111

第一个问题是我在标记化过程中丢失了换行符，我真的需要换行符才能识别表单。但这不应该太难处理。更大的问题是：

我无法处理非词典单词。目前我为他们返回0，但这会使任何识别这首诗的尝试都感到困惑，因为该行的音节计数可能会减少。
此外，CMU词典经常说一个词有压力 - '1' - 当没有 - '0 - 时。这就是为什么输出看起来像这样：1101111101，它应该是抑扬格五音的压力：0101010101
那么我如何添加一些捏造因素，以便当它只是近似模式时，诗仍然被识别为抑扬格五音阶？当CMU字典不打算输出如此干净的结果时，编写一个标识01行的函数是没有用的。我想我正在问如何编写“部分匹配”算法。

Answer 1

欢迎使用堆栈溢出。我对Python并不熟悉，但我发现你还没有收到很多答案，所以我会尽力帮你解决问题。

首先提出一些建议：你会发现，如果你专注于你的问题，你得到答案的机会就会大大提高。你的帖子太长，包含几个不同的问题，所以它超出了大多数人回答问题的“注意力范围”。

回到主题：

在你修改你的问题之前，你问过如何让它变得不那么混乱。这是一个很大的问题，但您可能希望使用自上而下的程序方法并将代码分解为功能单元：

将语料库分成行
对于每一行：找到音节长度和压力模式。
分类压力模式。

你会发现第一步是python中的单个函数调用：

corpus.split("\n");

并且可以保留在main函数中，但第二步将更好地放在它自己的函数中，第三步需要自行拆分，并且可能更好地使用面向对象的方法来解决。如果你在学院，你或许可以说服CS教师给你一个研究生几个月的时间，并帮助你而不是一些工作室要求。

现在回答您的其他问题：

不会丢失换行符：正如@ykaganovich所提到的，您可能希望将语料库拆分为行并将其提供给标记器。

不在字典/错误中的字：CMU dictionary home page说：

发现错误？请联系开发人员。我们将研究问题并改进字典。（请参见下方的联系信息。）

可能有一种方法可以在字典中添加自定义单词/更改现有单词，查看其网站或直接联系字典维护人员。如果你无法弄明白，你也可以在另外的问题中询问。 stackoverflow中肯定有人知道答案或者可以指向正确的资源。无论你决定什么，你都需要联系维护人员，并提供任何额外的单词和更正，以改进字典。

当输入语料库与模式不完全匹配时对输入语料库进行分类：您可能希望查看为模糊字符串比较提供的链接ykaganovich。一些要查找的算法：

Levenshtein距离：给出一个衡量两个字符串的不同之处，以及将一个字符串转换为另一个字符串所需的更改次数。优点：易于实现，缺点：未标准化，得分为2意味着长度为20的模式匹配良好，但长度为3的模式匹配不佳。
Jaro-Winkler字符串相似性度量：类似于Levenshtein，但基于在两个字符串中以相同顺序出现的字符序列的数量。它实现起来有点困难，但给出了标准化值（0.0 - 完全不同，1.0 - 相同），适用于对应力模式进行分类。 CS postgrad或去年的本科生不应该有太多麻烦（暗示提示）。

我认为这些都是你的问题。希望这有点帮助。

Answer 2

要保留换行符，请在将每行发送到cmu解析器之前逐行解析。

对于处理单音节单词，当nltk返回1时，你可能想要尝试0和1（看起来nltk已经为一些永远不会受到压力的单词返回0，比如“the”）。所以，你最终会得到多种排列： 1101111101 0101010101 1101010101

等等。然后你必须选择看起来像已知形式的那些。

对于非词典单词，我也会以同样的方式捏造它：弄清楚音节的数量（通过计算元音的最愚蠢的方式），并排列所有可能的压力。也许添加更多规则，例如“ea是单个音节，尾随e是无声的”......

我从未使用其他类型的模糊测试，但您可以查看https://stackoverflow.com/questions/682367/good-python-modules-for-fuzzy-string-comparison的一些想法。

Answer 3

这是我在stackoverflow上的第一篇文章。我是一个蟒蛇新手，所以请原谅代码风格的任何缺陷。但我也试图从诗歌中提取准确的米。这个问题中包含的代码对我有所帮助，所以我发布了我在此基础上构建的内容。这是将压力提取为单个字符串的一种方法，通过“捏造因子”来纠正。对于cmudict偏见，不要丢失不在cmudict中的单词。

import nltk
from nltk.corpus import cmudict

prondict = cmudict.dict()

#
# parseStressOfLine(line) 
# function that takes a line
# parses it for stress
# corrects the cmudict bias toward 1
# and returns two strings 
#
# 'stress' in form '0101*,*110110'
#   -- 'stress' also returns words not in cmudict '0101*,*1*zeon*10110'
# 'stress_no_punct' in form '0101110110'


def parseStressOfLine(line):

    stress=""
    stress_no_punct=""
    print line

    tokens = [words.lower() for words in nltk.word_tokenize(line)] 
    for word in tokens:        

        word_punct =  strip_punctuation_stressed(word.lower())
        word = word_punct['word']
        punct = word_punct['punct']

        #print word

        if word not in prondict:
            # if word is not in dictionary
            # add it to the string that includes punctuation
            stress= stress+"*"+word+"*"
        else:
            zero_bool=True
            for s in prondict[word]:
                # oppose the cmudict bias toward 1
                # search for a zero in array returned from prondict
                # if it exists use it
                # print strip_letters(s),word
                if strip_letters(s)=="0":
                    stress = stress + "0"
                    stress_no_punct = stress_no_punct + "0"
                    zero_bool=False
                    break

            if zero_bool:
                stress = stress + strip_letters(prondict[word][0])
                stress_no_punct=stress_no_punct + strip_letters(prondict[word][0])

        if len(punct)>0:
            stress= stress+"*"+punct+"*"

    return {'stress':stress,'stress_no_punct':stress_no_punct}



# STRIP PUNCTUATION but keep it
def strip_punctuation_stressed(word):
    # define punctuations
    punctuations = '!()-[]{};:"\,<>./?@#$%^&*_~'
    my_str = word

    # remove punctuations from the string
    no_punct = ""
    punct=""
    for char in my_str:
        if char not in punctuations:
            no_punct = no_punct + char
        else:
            punct = punct+char

    return {'word':no_punct,'punct':punct}


# CONVERT the cmudict prondict into just numbers
def strip_letters(ls):
    #print "strip_letters"
    nm = ''
    for ws in ls:
        #print "ws",ws
        for ch in list(ws):
            #print "ch",ch
            if ch.isdigit():
                nm=nm+ch
                #print "ad to nm",nm, type(nm)
    return nm


# TESTING  results 
# i do not correct for the '2'
line = "This day (the year I dare not tell)"
print parseStressOfLine(line)
line = "Apollo play'd the midwife's part;"
print parseStressOfLine(line)
line = "Into the world Corinna fell,"
print parseStressOfLine(line)


""" 

OUTPUT 

This day (the year I dare not tell)
{'stress': '01***(*011111***)*', 'stress_no_punct': '01011111'}
Apollo play'd the midwife's part;
{'stress': "0101*'d*01211***;*", 'stress_no_punct': '010101211'}
Into the world Corinna fell,
{'stress': '01012101*,*', 'stress_no_punct': '01012101'}

用NLTK和CMU Dict发现诗歌形式

3 个答案: