注意:对于这个问题,我不能使用除io和sys之外的任何导入
对于NLP赋值,我必须创建一个程序,它将语法文件和话语文件作为系统参数,我已经完成了。
问题是,我对如何实现确定性CYK算法感到困惑,该算法将字符串输出为扩展的Chomsky Normal Form(eCNF)。
我尝试实现Node类,但是在实现正确的表单时遇到了很多麻烦。我还发现了CYK算法的概率版本的实现,但它们让我更加困惑。我不想要任何概率分数。
我尝试成功创建一个矩阵P [i] [j],但它并没有成为三角形,当我用我的话语行中的单词填充它时,它只接受了行中的最后一个单词。
这是我想要遵循的伪代码:
Set P to a (n+1) x (n+1) matrix
for j = 1 to Length(words) do
for i = j-1 downto 0 do
for each non-terminal N in G do
P[i][j].NT[N] = empty array
for j = 1 to Length(words) do
for each rule N -> words[j] in G do
append (j, N -> words[j]) to P[j-1][j].NT[N]
change = true
while change do
change = false
for each non-terminal N in G do
if P[j-1][j].NT[N] is not empty and
there is a rule N' -> N in G and
(j, N' -> N) is not in P[j-1][j].NT[N'] then
append (j, N' -> N) to P[j-1][j].NT[N']
change = true
for j = 2 to Length(words) do
for i = j-2 downto 0 do
for k = i+1 to j-1 do
for each rule N -> A B in G such that
P[i][k].NT[A] is nonempty and
P[k][j].NT[B] is nonempty do
append (k, N -> A B) to P[i][j].NT[N]
change = true
while change do
change = false
for each non-terminal N in G do
if P[i][j].NT[N] is not empty and
there is a rule N' -> N in G and
(j, N' -> N) is not in P[i][j].NT[N'] then
append (j, N' -> N) to P[i][j].NT[N']
change = true
return P
以下是两个示例输入文件:
语法
S -> NP VP
NP -> Det N
NP -> PN
Det -> "the"
N -> "dog"
N -> "rat"
N -> "elephant"
PN -> "Alice"
PN -> "Bob"
VP -> V NP
V -> "admired"
V -> "bit"
V -> "chased"
话语
the aardvark bit the dog
the dog bit the man
Bob killed Alice
到目前为止,我的程序可以判断何时可以解析句子以及何时不能解析。现在我需要接受可以解析并解析它们的话语。
输出应如下所示:
[S [NP [Det "the"] [N "man"]] [VP [V "shot"] [NP [Det "the"] [N "elephant"]]]]
这是我的程序,删除了所有错误诱导代码:
import sys
import io
# usage = python CKYdet.py g#.ecfg u#L.utt
# Command Line Arguments - argv[0], argv[1], argv[2]
script = sys.argv[0]
grammarFile = open(sys.argv[1])
utteranceFile = open(sys.argv[2])
# Parsing algorithm
def CKYparse(uttline):
with open(sys.argv[1]) as rules:
# The following two lines throw index out of bound error. Not sure I need to select grammar rules this way.
# rhs = [line.split("-> ", 1)[1].strip('\n ') for line in rules]
# lhs = [line.split(None, 1)[0] for line in rules]
# Here I want to assign the words to their repective grammar rules
#Then I need to add each word to the matrix according to the grammar
#Then outpit the matrix with priper formatting
return "Valid parse goes here!" # Temporary return value until parse matrix P can be returned
# Initialize arrays from grammarFile
ruleArray = []
wordsInQuotes = []
for line in grammarFile:
rule = line.rstrip('\n')
start = line.find('"') + 1
end = line.find('"', start)
ruleArray.append(rule)
wordsInQuotes.append(line[start:end]) #create a set of words from grammar file
# Print final output
# Check whether line in utteranceFile can be parsed.
# If so, parse it.
# If not, print "No valid parse"
n = 0
for line in utteranceFile:
uttline = line
n = n + 1
uttString = "Utterance #{}: {}".format(n, line)
notValidString = "No valid parse\n"
if (all(x in wordsInQuotes for x in line.split())): #if word is found in grammarFile
print "".join((uttString, CKYparse(line)))
else:
print "".join((uttString, notValidString))
我理解算法的原理,但试图在没有NLTK的情况下用Python编写它是非常棘手的。