我有一大堆信息。一个非常大的文本文件,大约200k行。此文本文件是通过合并数千页PDF文本(显然通过OCR提取)构建的。这些信息是“会议纪要”。来自医疗委员会。在这些信息中,有一个重复出现的关键信息模式,如"
##-## (this is a numbered designation of the 'case')
ACTION: [.....] (this is a sentence that describes what procedure or action is being taken with this 'case')
DECISION [.....] (this is a sentence that describes the outcome or decision of a medical board about this specific case and action)
Here is a live example (with some data scrambled for obvious medical information reasons)
06-02 Cancer and bubblegum trials Primary Investigator:
"Dr. Strangelove, Ph.D."
"ACTION: At the January 4, 2015 meeting, request for review and approval of the Application for Initial Review"
and attachments for the above-referenced study.
"DECISION: After discussing the risks and safety of the human subjects that will take part in this study, the Board"
approved the submitted documents and initiation of the study. Waiver of Consent granted.
"Approval Period: January 4, 2015 – January 3, 2016"
"Total = 6. Vote: For = 6, Against = 0, Abstain = 0"
我需要提取非常简单的关键信息,最终看起来像:
##-##
ACTION: Initial Application for Review
DECISION: Initial Application Approved by Board
因此关键标准是##-##
字段以及关键字ACTION
&之后的任何句子。 DECISION
到目前为止,通过在TextWrangler中使用正则表达式,我能够匹配
(\d\d-\d\d)
或(ACTION)
或(DECISION)
....我正在努力做的是弄清楚如何选择所有其他文字并将其删除,或者只是复制此分组并把它放到另一个文件中。
我打算在文本管理器中运行的Bash文件中使用正则表达式和其他任何东西。任何帮助都非常受欢迎,因为我是正规表达的菜鸟。 Bash脚本我是新手。
答案 0 :(得分:0)
假设输入文件中存在一个小错误:DECISION: ...
而不是DECISION ...
,您可以使用awk轻松实现此目的。我们所要做的就是检查一行是以DECISION
,ACTION
还是##-##
开头的。正则表达式为/^(##-##)|^(ACTION)|^(DECISION)/
。得到的awk单行如下:
$ awk '/^(##-##)|^(ACTION)|^(DECISION)/ { print }' /path/to/file
使用示例:
$ head -n7 file
##-##
ACTION: Initial Application for Review
DECISION: Initial Application Approved by Board
Here is a live example (with some data scrambled for obvious medical
information reasons)
$ awk '/^(##-##)|^(ACTION)|^(DECISION)/ { print }' file
##-##
ACTION: Initial Application for Review
DECISION: Initial Application Approved by Board
如果行动和决定的数据在方括号之间,您将需要另一个正则表达式来提取信息,在这种情况下留下评论。