输入文件:
DATE: 07/01/15 @ 0800 HYRULE HOSPITAL PAGE 1
USER: LINK Antibiotic Resistance Report
--------------------------------------------------------------------------------------------
Activity Date Range: 01/01/15 - 02/01/15
--------------------------------------------------------------------------------------------
HH0000000001 LINK,DARK 30/M <DIS IN 01/05> (UJ00000001) A001-01 0A ZELDA,PRINCESS MD
15:M0000001R COMP, Coll: 01/02/15-0800 Recd: 01/02/15-0850 (R#00000001) ZELDA,PRINCESS MD
Source: SPUTUM
PSEUDOMONAS FLUORESCENS LEVOFLOXACIN >=8 R
--------------------------------------------------------------------------------------------
HH0000000002 FAIRY,GREAT 25/F <DIS IN 01/06> (UJ00000002) A002-01 0A ZELDA,PRINCESS MD
15:M0000002R COMP, Coll: 01/03/15-2025 Recd: 01/03/15-2035 (R#00000002) ZELDA,PRINCESS MD
Source: URINE- STRAIGHT CATH
PROTEUS MIRABILIS CEFTRIAXONE-other R
--------------------------------------------------------------------------------------------
HH0000000003 MAN,OLD 85/M <DIS IN 01/07> (UJ00000003) A003-01 0A ZELDA,PRINCESS MD
15:M0000003R COMP, Coll: 01/04/15-1800 Recd: 01/04/15-1800 (R#00000003) ZELDA,PRINCESS MD
Source: URINE-CLEAN VOIDED SPEC
ESCHERICHIA COLI LEVOFLOXACIN >=8 R
--------------------------------------------------------------------------------------------
编程/脚本和Python的全新。您如何建议循环此示例输入以获取字段中的特定文本?
每位患者都有一个唯一的标识符(例如HH0000000001)。我想从每一行获取特定文本。
输出应如下所示:
Date|Time|Name|Account|Specimen|Source|Antibiotic
01/02/15|0800|LINK, DARK|HH0000000001|PSEUDOMONAS FLUORESCENS|SPUTUM|LEVOFLOXACIN
01/03/15|2025|FAIRY, GREAT|HH0000000002|PROTEUS MIRABILIS|URINE- STRAIGHT CATH|CEFTRIAXONE-other
编辑:我当前的代码如下所示:
(免责声明:我在黑暗中摸索,所以代码根本不会很漂亮。
input = open('report.txt')
output = open('abx.txt', 'w')
date = '' # Defining global variables outside of the loop
time = ''
name = ''
name_last = ''
name_first = ''
account = ''
specimen = ''
source = ''
output.write('Date|Time|Name|Account|Specimen|Source\n')
lines = input.readlines()
for index, line in enumerate(lines):
print index, line
if last_line_location:
new_patient = True
if not first_time_through:
output.write("{}|{}|{}, {}|{}|{}|{}\n".format(
'Date', # temporary placeholder
'Time', # temporary placeholder
name_last.capitalize(),
name_first.capitalize(),
account,
'Specimen', # temporary placeholder
'Source' # temporary placeholder
) )
last_line_location = False
first_time_through = False
for each in lines:
if line.startswith('HH'): # Extract account and name
account = line.split()[0]
name = line.split()[1]
name_last = name.split(',')[0]
name_first = name.split(',')[1]
last_line_location = True
input.close()
output.close()
目前,输出将跳过第一位患者,并且仅显示第二位和第三位患者的信息。输出如下:
Date|Time|Name|Account|Specimen|Source
Date|Time|Fairy, Great|HH0000000002|Specimen|Source
Date|Time|Man, Old|HH0000000003|Specimen|Source
请随时就如何改善此方面提出建议,包括输出方式或整体策略。
答案 0 :(得分:1)
如果您添加...
,您的代码实际上有效last_line_location = True
first_time_through = True
...在你的for循环之前
你也要求指点......
正如评论中所建议的那样,您可以查看re
模块。
我已经把一些东西敲了一下,这表明了这一点。它可能不适合所有数据,因为三个记录是一个非常小的样本,我做了一些假设
最后一项也很人为,因为没有什么可以搜索的(例如Coll
,Source
)。例如,如果最后一行的开头没有空格,它将失败。
此代码仅仅是对另一种做事方式的建议:
import re
startflag = False
with open('report.txt','r') as infile:
with open('abx.txt','w') as outfile:
outfile.write('Date|Time|Name|Account|Specimen|Source|Antibiotic\n')
for line in infile:
if '---------------' in line:
if startflag:
outfile.write('|'.join((date, time, name, account, spec, source, anti))+'\n')
else:
startflag = True
continue
if 'Activity' in line:
startflag = False
acc_name = re.findall('HH\d+ \w+,\w+', line)
if acc_name:
account, name = acc_name[0].split(' ')
date_time = re.findall('(?<=Coll: ).+(?= Recd:)', line)
if date_time:
date, time = date_time[0].split('-')
source_re = re.findall('(?<=Source: ).+',line)
if source_re:
source = source_re[0].strip()
anti_spec = re.findall('^ +(?!Source)\w+ *\w+ + \S+', line)
if anti_spec:
stripped_list = anti_spec[0].strip().split()
anti = stripped_list[-1]
spec = ' '.join(stripped_list[:-1])
<强>输出强>
Date|Time|Name|Account|Specimen|Source|Antibiotic
01/02/15|0800|LINK,DARK|HH0000000001|PSEUDOMONAS FLUORESCENS|SPUTUM|LEVOFLOXACIN
01/03/15|2025|FAIRY,GREAT|HH0000000002|PROTEUS MIRABILIS|URINE- STRAIGHT CATH|CEFTRIAXONE-other
01/04/15|1800|MAN,OLD|HH0000000003|ESCHERICHIA COLI|URINE-CLEAN VOIDED SPEC|LEVOFLOXACIN
编辑:
显然,在损坏记录的情况下,应该在写入之间将变量重置为某个虚拟值。此外,如果在最后一条记录之后没有破折号线,则不会按原样写入。