Python:如何遍历行块并在行内复制特定文本

时间:2015-07-14 20:54:25

标签: python text-processing

输入文件:

DATE: 07/01/15 @ 0800                 HYRULE HOSPITAL                         PAGE 1
USER: LINK                      Antibiotic Resistance Report
--------------------------------------------------------------------------------------------
Activity Date Range: 01/01/15 - 02/01/15
--------------------------------------------------------------------------------------------
HH0000000001 LINK,DARK 30/M <DIS IN 01/05> (UJ00000001) A001-01 0A ZELDA,PRINCESS MD
15:M0000001R    COMP, Coll: 01/02/15-0800 Recd: 01/02/15-0850 (R#00000001) ZELDA,PRINCESS MD
    Source: SPUTUM                                  
       PSEUDOMONAS FLUORESCENS            LEVOFLOXACIN   >=8   R                            
--------------------------------------------------------------------------------------------
HH0000000002 FAIRY,GREAT   25/F <DIS IN 01/06> (UJ00000002) A002-01 0A ZELDA,PRINCESS MD    
15:M0000002R    COMP, Coll: 01/03/15-2025 Recd: 01/03/15-2035 (R#00000002) ZELDA,PRINCESS MD
    Source: URINE- STRAIGHT CATH                    
   PROTEUS MIRABILIS                  CEFTRIAXONE-other      R                          
--------------------------------------------------------------------------------------------
HH0000000003 MAN,OLD   85/M <DIS IN 01/07> (UJ00000003) A003-01 0A ZELDA,PRINCESS MD 
15:M0000003R    COMP, Coll: 01/04/15-1800 Recd: 01/04/15-1800 (R#00000003) ZELDA,PRINCESS MD
    Source: URINE-CLEAN VOIDED SPEC                 
   ESCHERICHIA COLI                   LEVOFLOXACIN   >=8   R                            
--------------------------------------------------------------------------------------------

编程/脚本和Python的全新。您如何建议循环此示例输入以获取字段中的特定文本?

每位患者都有一个唯一的标识符(例如HH0000000001)。我想从每一行获取特定文本。

输出应如下所示:

Date|Time|Name|Account|Specimen|Source|Antibiotic
01/02/15|0800|LINK, DARK|HH0000000001|PSEUDOMONAS FLUORESCENS|SPUTUM|LEVOFLOXACIN
01/03/15|2025|FAIRY, GREAT|HH0000000002|PROTEUS MIRABILIS|URINE- STRAIGHT CATH|CEFTRIAXONE-other

编辑:我当前的代码如下所示:

(免责声明:我在黑暗中摸索,所以代码根本不会很漂亮。

input = open('report.txt')
output = open('abx.txt', 'w')

date = ''  # Defining global variables outside of the loop
time = ''
name = ''
name_last = ''
name_first = ''
account = ''
specimen = ''
source = ''

output.write('Date|Time|Name|Account|Specimen|Source\n')
lines = input.readlines()

for index, line in enumerate(lines):
    print index, line

    if last_line_location:
        new_patient = True
        if not first_time_through:
            output.write("{}|{}|{}, {}|{}|{}|{}\n".format(
                'Date', # temporary placeholder
                'Time', # temporary placeholder
                name_last.capitalize(),
                name_first.capitalize(),
                account,
                'Specimen', # temporary placeholder
                'Source' # temporary placeholder
                ) )
        last_line_location = False
        first_time_through = False

    for each in lines:
        if line.startswith('HH'):  # Extract account and name
            account = line.split()[0]
            name = line.split()[1]
            name_last = name.split(',')[0]
            name_first = name.split(',')[1]
            last_line_location = True

input.close()
output.close()

目前,输出将跳过第一位患者,并且仅显示第二位和第三位患者的信息。输出如下:

Date|Time|Name|Account|Specimen|Source
Date|Time|Fairy, Great|HH0000000002|Specimen|Source
Date|Time|Man, Old|HH0000000003|Specimen|Source

请随时就如何改善此方面提出建议,包括输出方式或整体策略。

1 个答案:

答案 0 :(得分:1)

如果您添加...

,您的代码实际上有效
last_line_location = True
first_time_through = True

...在你的for循环之前

你也要求指点......

正如评论中所建议的那样,您可以查看re模块。

我已经把一些东西敲了一下,这表明了这一点。它可能不适合所有数据,因为三个记录是一个非常小的样本,我做了一些假设 最后一项也很人为,因为没有什么可以搜索的(例如CollSource)。例如,如果最后一行的开头没有空格,它将失败。

此代码仅仅是对另一种做事方式的建议:

import re

startflag = False
with open('report.txt','r') as infile:
    with open('abx.txt','w') as outfile:
        outfile.write('Date|Time|Name|Account|Specimen|Source|Antibiotic\n')
        for line in infile:
            if '---------------' in line:
                if startflag:
                    outfile.write('|'.join((date, time, name, account, spec, source, anti))+'\n')
                else:
                    startflag = True
                continue
            if 'Activity' in line:
                startflag = False

            acc_name = re.findall('HH\d+ \w+,\w+', line)
            if acc_name:
                account, name = acc_name[0].split(' ')

            date_time = re.findall('(?<=Coll: ).+(?= Recd:)', line)
            if date_time:
                date, time = date_time[0].split('-')

            source_re = re.findall('(?<=Source: ).+',line)
            if source_re:
                source = source_re[0].strip()

            anti_spec = re.findall('^ +(?!Source)\w+ *\w+ + \S+', line)
            if anti_spec:
                stripped_list = anti_spec[0].strip().split()
                anti = stripped_list[-1]
                spec = ' '.join(stripped_list[:-1])

<强>输出

Date|Time|Name|Account|Specimen|Source|Antibiotic
01/02/15|0800|LINK,DARK|HH0000000001|PSEUDOMONAS FLUORESCENS|SPUTUM|LEVOFLOXACIN
01/03/15|2025|FAIRY,GREAT|HH0000000002|PROTEUS MIRABILIS|URINE- STRAIGHT CATH|CEFTRIAXONE-other
01/04/15|1800|MAN,OLD|HH0000000003|ESCHERICHIA COLI|URINE-CLEAN VOIDED SPEC|LEVOFLOXACIN

编辑:
显然,在损坏记录的情况下,应该在写入之间将变量重置为某个虚拟值。此外,如果在最后一条记录之后没有破折号线,则不会按原样写入。