使用正则表达式从文本文件中读取不同长度的邮件地址

时间:2018-07-12 06:33:39

标签: regex python-3.x text-files generator

我正在尝试读取文本文件并从中收集地址。这是文本文件中条目之一的示例:

Electrical Vendor                                                                                    Contact:        John Smith                                                              Phone #:        123-456-7890
              Address:              1234 ADDRESS ROAD                                                           Ship To:
                                    Suite 123                                                                                    ,
                                    Nowhere, CA United States 12345
              Phone:                234-567-8901                                       E-Mail:                  john.smith@gmail.com
              Fax:                  345-678-9012                                       Web Address:             www.electricalvendor.com
              Acct. No:             123456                                                   Monthly Due Date:                                Days Until Due
              Tax ID:                                                                                                   Fed 1099 Exempt                                 Discount On Assets Only
              G/L Liab. Override:
              G/L Default Exp:
              Comments:
                                   APPROVED FOR ELECTRICAL THINGS

当地址中的行数变化时,我无法全神贯注于如何搜索和存储每个条目的地址。当前,我有一个生成器,可读取文件的每一行。然后,get_addrs()方法尝试在文件中捕获诸如Address:Ship关键字之类的标记,以表示何时需要存储地址。然后,我使用正则表达式在带有Address:关键字的行之后的行中搜索邮政编码。我想我已经弄清楚了如何使用该方法为所有地址成功保存第二行。但是,在一些地址中,有套房编号或其他信息,导致地址变成三行而不是两行。我不确定如何解决这个问题,因此尝试将save_previous()方法扩展为三行,但是我做得不太正确。这是我能够使用以下命令成功保存所有两个行地址的代码:

import re


class GetAddress():
    def __init__(self):
        self.line1 = []
        self.line2 = []
        self.s_line1 = []
        self.addr_index = 0
        self.ship_index = 0
        self.no_ship = False
        self.addr_here = False
        self.prev_line = []
        self.us_zip = ''

    # Check if there is a shipping address.
    def set_no_ship(self, line):
        try:
            self.no_ship = line.index(',') == len(line) - 1
        except ValueError:
            pass

    # Save two lines at a time to see whether or not the previous 
    # line contains 'Address:' and 'Ship'.
    def save_previous(self, line):
        self.prev_line += [line]

        if len(self.prev_line) > 2:
            del self.prev_line[0]

    def get_addrs(self, line):
        self.addr_here = 'Address:' in line and 'Ship' in line
        self.po_box = False
        self.no_ship = False
        self.addr_index = 0
        self.ship_index = 0
        self.zip1_index = 0

        self.set_no_ship(line)
        self.save_previous(line)

        # Check if 'Address:' and 'Ship' are in the previous line.
        self.prev_addr = (
            'Address:' in self.prev_line[0]
            and 'Ship' in self.prev_line[0])

        if self.addr_here:
            self.po_box = 'Box' in line or 'BOX' in line
            self.addr_index = line.index('Address:') + 1
            self.ship_index = line.index('Ship')

            # Get the contents of the line between 'Address:' and
            # 'Ship' if both words are present in this line.
            if self.addr_index is not self.ship_index:
                self.line1 += [' '.join(line[self.addr_index:self.ship_index])]

            elif self.addr_index is self.ship_index:
                self.line1 += ['']

        if len(self.prev_line) > 1 and self.prev_addr:
            self.po_box = 'Box' in line or 'BOX' in line
            self.us_zip = re.search(r'(\d{5}(\-\d{4})?)', ' '.join(line))
            if self.us_zip and not self.po_box:
                self.zip1_index = line.index(self.us_zip.group(1))

            if self.no_ship:
                self.line2 += [' '.join(line[:line.index(',')])]

            elif self.zip1_index and not self.no_ship:
                self.line2 += [' '.join(line[:self.zip1_index + 1])]

            elif len(self.line1) > 0 and not self.line1[-1]:
                self.line2 += ['']


# Create a generator to read each line of the file.
def read_gen(infile):
    with open(infile, 'r') as file:
        for line in file:
            yield line.split()


infile = 'Vendor List.txt'
info = GetAddress()

for i, line in enumerate(read_gen(infile)):
    info.get_addrs(line)

我仍然是Python的初学者,因此我确定很多代码可能是多余的或不必要的。我希望获得一些反馈,以便我在捕获两行和三行地址时如何简化和简化它。

1 个答案:

答案 0 :(得分:0)

我还将这个问题发布到Reddit,并且u/Binary101010指出文本文件是固定宽度的,并且可能以仅选择必要地址信息的方式对每一行进行切片。利用这种直觉,我在生成器表达式中添加了一些功能,并且能够通过以下代码产生所需的效果:

Object