我有一个要放入数据框中的地址列表,其中每一行是一个新地址,而各列是地址的单位(标题,街道,城市)。
但是,列表的结构方式是,某些地址比其他地址长。例如:
address = ['123 Some Street, City','45 Another Place, PO Box 123, City']
我有一个熊猫数据框,其中包含以下列:
Index Court Address Zipcode Phone
0 Court 1 123 Court Dr, Springfield 12345 11111
1 Court 2 45 Court Pl, PO Box 45, Pawnee 54321 11111
2 Court 3 1725 Slough Ave, Scranton 18503 11111
3 Court 4 101 Court Ter, Unit 321, Eagleton 54322 11111
我想根据地址中有多少个逗号分隔符将“地址”列分为多达三列列,并用NaN填充将丢失值的位置。
例如,我希望数据看起来像这样:
Index Court Address Address2 City Zip Phone
0 Court 1 123 Court Dr NaN Springfield ... ...
1 Court 2 45 Court Pl PO Box 45 Pawnee ... ...
2 Court 3 1725 Slough Ave NaN Scranton ... ...
3 Court 4 101 Court Ter Unit 321 Eagleton ... ...
我经过仔细研究,并在StackOverflow上尝试了许多不同的解决方案,但均无济于事。我得到的最接近的代码是:
df2 = pd.concat([df, df['Address'].str.split(', ', expand=True)], axis=1)
但是返回一个数据帧,该数据帧将以下三列添加到结构如下:
... 0 1 2
... 123 Court Dr Springfield None
... 45 Court Pl PO Box 45 Pawnee
这是关闭的,但是如您所见,对于较短的条目,城市与第二行地址行对齐,对于较长的条目。
理想情况下,第2列应在每行中填充一个城市,第1列应在“无”和第二个地址行之间交替显示。
我希望这很有道理-这很难说出来。谢谢!
答案 0 :(得分:0)
地址,尤其是人工输入的地址可能会很棘手。但是,如果您的地址仅适合这两种格式,则可以使用:
注意:如果您需要考虑其他格式,将打印出罪魁祸首。
def split_address(df):
for index,row in df.iterrows():
full_address = df['address']
if full_address.count(',') == 3:
split = full_address.split(',')
row['address_1'] = split[0]
row['address_2'] = split[1]
row['city'] = split[2]
else if full_address.count(',') == 2:
split = full_address.split(',')
row['address_1'] = split[0]
row['city'] = split[1]
else:
print("address does not fit known formats {0}".format(full_address))
基本上,应该帮助您的两件事是string.count()
函数,该函数将告诉您字符串中的逗号数,而您已经发现的string.split()
将输入拆分为一个数组。您可以引用此数组的各个部分,以将片段分配给正确的列。
答案 1 :(得分:0)
您可以执行以下操作:
df['Address1'] = df['Address'].str.split(',').str[0]
df['Address2'] = df['Address'].str.extract(',(.*),')
df['City'] = df['Address'].str.split(',').str[-1]
答案 2 :(得分:0)
您可以考虑使用软件包usaddress创建函数。当我需要将地址分成多个部分时,这对我非常有帮助:
import usaddress
df = pd.DataFrame(['123 Main St. Suite 100 Chicago, IL', '123 Main St. PO Box 100 Chicago, IL'], columns=['Address'])
然后创建用于拆分数据的函数:
def Address1(x):
try:
data = usaddress.tag(x)
if 'AddressNumber' in data[0].keys() and 'StreetName' in data[0].keys() and 'StreetNamePostType' in data[0].keys():
return data[0]['AddressNumber'] + ' ' + data[0]['StreetName'] + ' ' + data[0]['StreetNamePostType']
except:
pass
def Address2(x):
try:
data = usaddress.tag(x)
if 'OccupancyType' in data[0].keys() and 'OccupancyIdentifier' in data[0].keys():
return data[0]['OccupancyType'] + ' ' + data[0]['OccupancyIdentifier']
elif 'USPSBoxType' in data[0].keys() and 'USPSBoxID' in data[0].keys():
return data[0]['USPSBoxType'] + ' ' + data[0]['USPSBoxID']
except:
pass
def PlaceName(x):
try:
data = usaddress.tag(x)
if 'PlaceName' in data[0].keys():
return data[0]['PlaceName']
except:
pass
df['Address1'] = df.apply(lambda x: Address1(x['Address']), axis=1)
df['Address2'] = df.apply(lambda x: Address2(x['Address']), axis=1)
df['City'] = df.apply(lambda x: PlaceName(x['Address']), axis=1)
退出:
Address Address1 Address2 City
0 123 Main St. Suite 100 Chicago, IL 123 Main St. Suite 100 Chicago
1 123 Main St. PO Box 100 Chicago, IL 123 Main St. PO Box 100 Chicago