在NLTK中生成PCFG

时间:2018-03-12 16:27:22

标签: nltk context-free-grammar

我正在尝试从包含解析树的文件中学习pcfg,例如:

(S(DECL_MD(NP_PPSS(PRON_PPSS(ii)))(VERB_MD(pt_verb_md need))(NP_NN(ADJ_AT(aa))(NOUN_NN(航班))(PREP_IN(pt_prep_in from)))(AVPNP_NP(NOUN_NP) (charlotte charlotte))

这是我的相关代码:

def loadData(path):
    with open(path ,'r') as f:
        data = f.read().split('\n')
    return data

def getTreeData(data):
    return map(lambda s: tree.Tree.fromstring(s), data)

# Main script
print("loading data..")
data = loadData('C:\\Users\\Rayyan\\Desktop\\MSc Data\\NLP\\parseTrees.txt')
print("generating trees..")
treeData = getTreeData(data)
print("done!")
print("done!")

此后,我在网上尝试了很多东西,例如: grammar = indu_pcfg(S,productions) 但是这里的制作总是内置的功能,例如:

productions = []
for item in treebank.items[:2]:
  for tree in treebank.parsed_sents(item):
    productions += tree.productions()

在我的情况下,我尝试用treeData替换“production”,但它不起作用。我错过了什么或做错了什么?

1 个答案:

答案 0 :(得分:3)

从构建树开始:

from nltk import tree
treeData_rules = []

# Extract the CFG rules (productions) for the sentence
for item in treeData:
    for production in item.productions():
    treeData_rules.append(production)
treeData_rules

然后你可以像这样提取Probabilistic-CFG(PCFG):

from nltk import induce_pcfg

S = Nonterminal('S')
grammar_PCFG = induce_pcfg(S, treeData_rules)
print(grammar_PCFG)