我正在尝试读取文本文件并从中收集地址。这是文本文件中条目之一的示例:
Electrical Vendor Contact: John Smith Phone #: 123-456-7890
Address: 1234 ADDRESS ROAD Ship To:
Suite 123 ,
Nowhere, CA United States 12345
Phone: 234-567-8901 E-Mail: john.smith@gmail.com
Fax: 345-678-9012 Web Address: www.electricalvendor.com
Acct. No: 123456 Monthly Due Date: Days Until Due
Tax ID: Fed 1099 Exempt Discount On Assets Only
G/L Liab. Override:
G/L Default Exp:
Comments:
APPROVED FOR ELECTRICAL THINGS
当地址中的行数变化时,我无法全神贯注于如何搜索和存储每个条目的地址。当前,我有一个生成器,可读取文件的每一行。然后,get_addrs()
方法尝试在文件中捕获诸如Address:
和Ship
关键字之类的标记,以表示何时需要存储地址。然后,我使用正则表达式在带有Address:
关键字的行之后的行中搜索邮政编码。我想我已经弄清楚了如何使用该方法为所有地址成功保存第二行。但是,在一些地址中,有套房编号或其他信息,导致地址变成三行而不是两行。我不确定如何解决这个问题,因此尝试将save_previous()
方法扩展为三行,但是我做得不太正确。这是我能够使用以下命令成功保存所有两个行地址的代码:
import re
class GetAddress():
def __init__(self):
self.line1 = []
self.line2 = []
self.s_line1 = []
self.addr_index = 0
self.ship_index = 0
self.no_ship = False
self.addr_here = False
self.prev_line = []
self.us_zip = ''
# Check if there is a shipping address.
def set_no_ship(self, line):
try:
self.no_ship = line.index(',') == len(line) - 1
except ValueError:
pass
# Save two lines at a time to see whether or not the previous
# line contains 'Address:' and 'Ship'.
def save_previous(self, line):
self.prev_line += [line]
if len(self.prev_line) > 2:
del self.prev_line[0]
def get_addrs(self, line):
self.addr_here = 'Address:' in line and 'Ship' in line
self.po_box = False
self.no_ship = False
self.addr_index = 0
self.ship_index = 0
self.zip1_index = 0
self.set_no_ship(line)
self.save_previous(line)
# Check if 'Address:' and 'Ship' are in the previous line.
self.prev_addr = (
'Address:' in self.prev_line[0]
and 'Ship' in self.prev_line[0])
if self.addr_here:
self.po_box = 'Box' in line or 'BOX' in line
self.addr_index = line.index('Address:') + 1
self.ship_index = line.index('Ship')
# Get the contents of the line between 'Address:' and
# 'Ship' if both words are present in this line.
if self.addr_index is not self.ship_index:
self.line1 += [' '.join(line[self.addr_index:self.ship_index])]
elif self.addr_index is self.ship_index:
self.line1 += ['']
if len(self.prev_line) > 1 and self.prev_addr:
self.po_box = 'Box' in line or 'BOX' in line
self.us_zip = re.search(r'(\d{5}(\-\d{4})?)', ' '.join(line))
if self.us_zip and not self.po_box:
self.zip1_index = line.index(self.us_zip.group(1))
if self.no_ship:
self.line2 += [' '.join(line[:line.index(',')])]
elif self.zip1_index and not self.no_ship:
self.line2 += [' '.join(line[:self.zip1_index + 1])]
elif len(self.line1) > 0 and not self.line1[-1]:
self.line2 += ['']
# Create a generator to read each line of the file.
def read_gen(infile):
with open(infile, 'r') as file:
for line in file:
yield line.split()
infile = 'Vendor List.txt'
info = GetAddress()
for i, line in enumerate(read_gen(infile)):
info.get_addrs(line)
我仍然是Python的初学者,因此我确定很多代码可能是多余的或不必要的。我希望获得一些反馈,以便我在捕获两行和三行地址时如何简化和简化它。
答案 0 :(得分:0)
我还将这个问题发布到Reddit,并且u/Binary101010指出文本文件是固定宽度的,并且可能以仅选择必要地址信息的方式对每一行进行切片。利用这种直觉,我在生成器表达式中添加了一些功能,并且能够通过以下代码产生所需的效果:
Object