Python正则表达式字符串组捕获

时间:2015-10-28 00:37:04

标签: python regex

我有各自的医疗报告,我试图捕获6组(第5组和第6组是可选的):

<临床详情|临床指征> +(text1)+(结果|报告)+(text2)+(解释|结论)+(text3)。

我正在使用的正则表达式是:

reportPat=re.compile(r'(Clinical details|indication)(.*?)(result|description|report)(.*?)(Interpretation|conclusion)(.*)',re.IGNORECASE|re.DOTALL)

工作除了字符串缺少它失败的可选组。我尝试在group5之后添加一个问号,如下所示:(解释|结论)?(。*)然后这个组合并到group4。我正在粘贴两个相互矛盾的字符串(一个包含5/6组,另一个没有它),供人们测试他们的正则表达式。谢谢你的帮助

文本1(所有出席的小组)

Technical Report:\nAdministrations:\n1.04 ml of Fluorine 18, fluorodeoxyglucose with aco - Bronchus and lung\nJA - Staging\n\nClinical Details:\nSquamous cell lung cancer, histology confirmed ?stage\nResult:\nAn FDG scan was acquired from skull base to upper thighs together with a low dose CT scan for attenuation correction and image fusion. \n\nThere is a large mass noted in the left upper lobe proximally, with lower grade uptake within a collapsed left upper lobe. This lesi\n\nInterpretation: \nThe scan findings are in keeping with the known lung primary in the left upper lobe and involvement of the lymph nodes as dThere is no evidence of distant metastatic disease.

文本2(没有第5组和第6组)

Technical Report:\nAdministrations:\n0.81 ml of Fluorine 18, fluorodeoxyglucose with activity 312.79\nScanner: 3D Static\nPatient Position: Supine, Head First. Arms up\n\n\nDiagnosis Codes:\n- Bronchus and lung\nJA - Staging\n\nClinical Indication:\nNewly diagnosed primary lung cancer with cranial metastasis. PET scan to assess any further metastatic disease.\n\nScanner DST 3D\n\nSession 1 - \n\n.\n\nResult:\nAn FDG scan was acquired from skull base to upper thighs together with a low dose CT scan for attenuation correction and image fusion.\n\nThere is increased FDG uptake in the right lower lobe mass abutting the medial and posterior pleura with central necrosis (maximum SUV 18.2). small nodule at the right paracolic gutte

2 个答案:

答案 0 :(得分:1)

似乎你所缺少的基本上是模式匹配的结束,以便在与5和5组的可选存在相结合时欺骗贪婪的比赛。 6.这个正则表达式可以解决这个问题,保持当前的组编号:

class Parent(val s: String)

class Child() extends Parent(Child.param)

object Child {
  val param = "I_AM_NEEDED"
}

所做的更改是将reportPat=re.compile( r'(Clinical details|indication)(.*)' r'(result|description|report)(.*?)' r'(?:(Interpretation|conclusion)(.*))?$', re.IGNORECASE|re.DOTALL) 添加到结尾,并将最后两个组封装在可选的非捕获组$中。还要注意如何通过拆分行(解释器在编译时自动连接)来轻松地使整个正则表达式更具可读性。

已添加:在查看匹配结果时,我看到了一些(?: ... )?:\n,可以通过在标题之间添加: \n来轻松清理和文本组。这是一个可选的非捕获冒号和空格组。你的正则表达式看起来像这样:

(?:[:\s]*)?

已添加2 :在此链接:https://regex101.com/r/gU9eV7/3,您可以看到正在使用的正则表达式。我还添加了一些单元测试用例来验证它是否适用于两个文本,而对于text1,它与text1匹配,而对于text2,它没有任何内容。我使用这个parallell来直接编辑python脚本来验证我的答案。

答案 1 :(得分:0)

以下模式适用于您的测试用例,但考虑到您需要解析的数据格式,我不能确信该模式适用于所有情况(例如我&#39;在每次关键字匹配后添加了:,以防止无意中匹配更常见的字词,例如resultdescription):

re.compile(
    r'(Clinical details|indication):(.+?)(result|description|report):(.+?)((Interpretation|conclusion):(.+?)){0,1}\Z',
    re.IGNORECASE|re.DOTALL
    )

我将最后两组分组,并使用{0,1}将其标记为可选。这意味着输出组与原始模式略有不同(您将有一个额外的组,第4组现在将包含最后2组的输出,最后2组的数据将成组5和6)。