NLP-Python-条件频率分布失败

时间:2020-08-23 17:54:29

标签: python nlp corpus

代码 这是一个hackerrank问题,提供的两个测试用例都失败了,请有人帮忙

from nltk.corpus import brown
from nltk.corpus import stopwords

def calculateCFD(cfdconditions, cfdevents):
    # Write your code here
    from nltk.corpus import brown
    from nltk import ConditionalFreqDist
    from nltk.corpus import stopwords
    stopword = set(stopwords.words('english'))
    cdev_cfd = [ (genre, word.lower()) for genre in cfdconditions for word in brown.words(categories=genre) if word.lower() not in stopword]
    #cdev_cfd = [list(x) for x in cdev_cfd]
    cdev_cfd = nltk.ConditionalFreqDist(cdev_cfd)
    a = cdev_cfd.tabulate(condition = cfdconditions, samples = cfdevents)
    inged_cfd = [ (genre, word.lower()) for genre in cfdconditions for word in brown.words(categories=genre) if (word.lower().endswith('ing') or word.lower().endswith('ed')) ]
    inged_cfd = [list(x) for x in inged_cfd]
    for wd in inged_cfd:
        if wd[1].endswith('ing') and wd[1] not in stopword:
            wd[1] = 'ing'
        elif wd[1].endswith('ed') and wd[1] not in stopword:
            wd[1] = 'ed'

    inged_cfd = nltk.ConditionalFreqDist(inged_cfd)    
    b = inged_cfd.tabulate(cfdconditions, samples = ['ed','ing'])
    return(a,b)

失败测试用例的输出是

                     many years 
    adventure    24    32 
        fiction    29    44 
science_fiction    11    16 
                  ed  ing 
        fiction 2943 1767 
      adventure 3281 1844 
science_fiction  574  293 


                 good    bad better 
      adventure     39      9     30 
        fiction     60     17     27 
        mystery     45     13     29 
science_fiction     14      1      4 
                  ed  ing 
      adventure 3281 1844 
        fiction 2943 1767 
science_fiction  574  293 
        mystery 2382 1374

请帮助我通过这些测试用例,因为我没有弄错地方

2 个答案:

答案 0 :(得分:0)

删除以下两行

cdev_cfd = [ (genre, word.lower()) for genre in cfdconditions for word in brown.words(categories=genre) if word.lower() not in stopword]
cdev_cfd = nltk.ConditionalFreqDist(cdev_cfd) 

并替换为

cdev_cfd = nltk.ConditionalFreqDist([ (genre, word.lower()) for genre in brown.categories() for word in brown.words(categories=genre) if word.lower() not in stopword and genre in cfdconditions])

答案 1 :(得分:0)

您的代码看起来不错。为了使这项工作有效,只需将条件一词更新为条件

a = cdev_cfd.tabulate(condition = cfdconditions, samples = cfdevents)
with  
a = cdev_cfd.tabulate(conditions = cfdconditions, samples = cfdevents)

它将以彩色显示:)