好的,我终于得到了我的语法来捕获我的所有测试用例,但我有一个重复(案例3)和误报(案例6,“PATTERN 5”)。以下是我的test cases和我的desired output。
我仍然是python的新手(虽然能够教我的孩子!吓人!)所以我确信有明显的方法来解决这个问题,我甚至不确定这是一个pyparsing问题。这是我的输出现在的样子:
['01/01/01','S01-12345','20/111-22-1001',['GLEASON', ['5', '+', '4'], '=', '9']]
['02/02/02','S02-1234','20/111-22-1002',['GLEASON', 'SCORE', ':', ['3', '+', '3'], '=', '6']]
['03/02/03','S03-1234','31/111-22-1003',['GLEASON', 'GRADE', ['4', '+', '3'], '=', '7']]
['03/02/03','S03-1234','31/111-22-1003',['GLEASON', 'SCORE', ':', '7', '=', ['4', '+', '3']]]
['04/17/04','S04-123','30/111-22-1004',['GLEASON', 'SCORE', ':', ['3', '+', '4', '-', '7']]]
['05/28/05','S05-1234','20/111-22-1005',['GLEASON', 'SCORE', '7', '[', ['3', '+', '4'], ']']]
['06/18/06','S06-10686','20/111-22-1006',['GLEASON', ['4', '+', '3']]]
['06/18/06','S06-10686','20/111-22-1006',['GLEASON', 'PATTERN', '5']]
['07/22/07','S07-2749','20/111-22-1007',['GLEASON', 'SCORE', '6', '(', ['3', '+', '3'], ')']]
这是语法
num = Word(nums)
arith_expr = operatorPrecedence(num,
[
(oneOf('-'), 1, opAssoc.RIGHT),
(oneOf('* /'), 2, opAssoc.LEFT),
(oneOf('+ -'), 2, opAssoc.LEFT),
])
accessionDate = Combine(num + "/" + num + "/" + num)("accDate")
accessionNumber = Combine("S" + num + "-" + num)("accNum")
patMedicalRecordNum = Combine(num + "/" + num + "-" + num + "-" + num)("patientNum")
score = (Optional(oneOf('( [')) +
arith_expr('lhs') +
Optional(oneOf(') ]')) +
Optional(oneOf('= -')) +
Optional(oneOf('( [')) +
Optional(arith_expr('rhs')) +
Optional(oneOf(') ]')))
gleason = Group("GLEASON" + Optional("SCORE") + Optional("GRADE") + Optional("PATTERN") + Optional(":") + score)
patientData = Group(accessionDate + accessionNumber + patMedicalRecordNum)
partMatch = patientData("patientData") | gleason("gleason")
和输出功能。
lastPatientData = None
for match in partMatch.searchString(TEXT):
if match.patientData:
lastPatientData = match
elif match.gleason:
if lastPatientData is None:
print "bad!"
continue
# getParts()
FOUT.write( "['{0.accDate}','{0.accNum}','{0.patientNum}',{1}]\n".format(lastPatientData.patientData, match.gleason))
正如您所看到的,输出不如它看起来那么好,我只是写一个文件并伪造一些语法。我一直在努力学习如何获得pyparsing中间结果,以便我可以使用它们。我应该写出来并运行第二个找到重复的脚本吗?
根据Paul McGuire的回答更新。这个函数的输出让我每个条目下降到一行,但现在我正在丢失分数(每个Gleason分数,在理智上,具有primary + secondary = total
形式。这是一个数据库,所以pri, sec,tot是单独的posgresql列,或者,对于解析器的输出,以逗号分隔的值)
accumPatientData = None
for match in partMatch.searchString(TEXT):
if match.patientData:
if accumPatientData is not None:
#this is a new patient data, print out the accumulated
#Gleason scores for the previous one
writeOut(accumPatientData)
accumPatientData = (match.patientData, [])
elif match.gleason:
accumPatientData[1].append(match.gleason)
if accumPatientData is not None:
writeOut(accumPatientData)
所以现在输出看起来像这样
01/01/01,S01-12345,20/111-22-1001,9
02/02/02,S02-1234,20/111-22-1002,6
03/02/03,S03-1234,31/111-22-1003,7,4+3
04/17/04,S04-123,30/111-22-1004,
05/28/05,S05-1234,20/111-22-1005,3+4
06/18/06,S06-10686,20/111-22-1006,,
07/22/07,S07-2749,20/111-22-1007,3+3
我想回到那里并抓住一些丢失的元素,重新排列它们,找到丢失的元素,并将它们全部放回去。像这样的伪代码:
def diceGleason(glrhs,gllhs)
if glrhs.len() == 0:
pri = gllhs[0]
sec = gllhs[2]
tot = pri + sec
return [pri, sec, tot]
elif glrhs.len() == 1:
pri = gllhs[0]
sec = gllhs[2]
tot = glrhs
return [pri, sec, tot]
else:
pri = glrhs[0]
sec = glrhs[2]
tot = gllhs
return [pri, sec, tot]
更新2:好的,保罗太棒了,但我很笨。在尝试了他所说的话之后,我尝试了几种方法来获得pri,sec和tot,但是我失败了。我一直收到这样的错误:
Traceback (most recent call last):
File "Stage1.py", line 81, in <module>
writeOut(accumPatientData)
File "Stage1.py", line 47, in writeOut
FOUT.write( "{0.accDate},{0.accNum},{0.patientNum},{1.pri},{1.sec},{1.tot}\n".format( pd, gleaso
nList))
AttributeError: 'list' object has no attribute 'pri'
这些属性错误是我不断得到的。显然我不明白之间发生了什么(保罗,我有这本书,我发誓它在我面前是开放的,我不明白)。这是my script。有什么东西在错误的地方吗?我称结果错了吗?
答案 0 :(得分:2)
我没有对您的解析器进行任何更改,但对解析后的代码进行了一些更改。
您并没有真正得到“重复”,问题是您每次看到Gleason评分时都会打印出当前的患者数据,并且您的一些患者数据记录包含多个Gleason评分条目。如果我理解你想要做什么,这里是我将遵循的伪代码:
accumulator = None
foreach match in (patientDataExpr | gleasonScoreExpr).searchString(source):
if it's a patientDataExpr:
if accumulator is not None:
# we are starting a new patient data record, print out the previous one
print out accumulated data
initialize new accumulator with current match and empty list for gleason data
else if it's a gleasonScoreExpr:
add this expression into the current accumulator
# done with the for loop, do one last printout of the accumulated data
if accumulator is not None:
print out accumulated data
这很容易转换为Python:
def printOut(patientDataTuple):
pd,gleasonList = patientDataTuple
print( "['{0.accDate}','{0.accNum}','{0.patientNum}',{1}]".format(
pd, ','.join(''.join(gl.rhs) for gl in gleasonList)))
accumPatientData = None
for match in partMatch.searchString(TEXT):
if match.patientData:
if accumPatientData is not None:
# this is a new patient data, print out the accumulated
# Gleason scores for the previous one
printOut(accumPatientData)
# start accumulating for a new patient data entry
accumPatientData = (match.patientData, [])
elif match.gleason:
accumPatientData[1].append(match.gleason)
#~ print match.dump()
if accumPatientData is not None:
printOut(accumPatientData)
我认为我没有正确地倾销格里森数据,但我认为你可以从这里调整它。
编辑:
您可以将diceGleason
作为解析操作附加到gleason
并获取此行为:
def diceGleasonParseAction(tokens):
def diceGleason(glrhs,gllhs):
if len(glrhs) == 0:
pri = gllhs[0]
sec = gllhs[2]
#~ tot = pri + sec
tot = str(int(pri)+int(sec))
return [pri, sec, tot]
elif len(glrhs) == 1:
pri = gllhs[0]
sec = gllhs[2]
tot = glrhs
return [pri, sec, tot]
else:
pri = glrhs[0]
sec = glrhs[2]
tot = gllhs
return [pri, sec, tot]
pri,sec,tot = diceGleason(tokens.gleason.rhs, tokens.gleason.lhs)
# assign results names for later use
tokens.gleason['pri'] = pri
tokens.gleason['sec'] = sec
tokens.gleason['tot'] = tot
gleason.setParseAction(diceGleasonParseAction)
你只有一个拼写错误,你总结pri
和sec
得到tot
,但这些都是字符串,所以你添加'3'和'4'并获得' 34' - 转换为int来进行添加就是所需要的。否则,我将diceGleason
逐字地保留在diceGleasonParseAction
的内部,以隔离您的逻辑,以便从修饰解析的令牌的机制中推断pri
,sec
和tot
使用新的结果名称。由于解析操作不会返回任何新内容,因此令牌会就地更新,然后随身携带以便稍后在输出方法中使用。