我在匹配数据框的列表和列时遇到问题,并且从匹配项中提取列中的特定匹配值。
数据集:
address
0 58 Chatham Street, Chatham, New Jersey, 07928
1 3420 W. MacArthur Blvd. Ste. C, Santa Ana, California
2 2016 Chalk Rd, Wake Forest, North Carolina, 27587
我有一个包含州名的列表
state = ['New York','New Jersey','California',...]
期望结果
address State
0 58 Chatham Street, Chatham, New Jersey, 07928 New Jersey
1 3420 W. MacArthur Blvd. Ste. C, Santa Ana, California California
2 2016 Chalk Rd, Wake Forest, North Carolina, 27587 North Carolina
我尝试过的代码
for i in state:
ship_add['state'] = ship_add['address'].str.strip(i)
答案 0 :(得分:1)
使用:
state = ['New York','New Jersey','California','North Carolina']
#word boundary
pat = '|'.join(r"\b{}\b".format(x) for x in state)
#if not necessary words boundary
#pat = '|'.join(state)
df['State'] = df['address'].str.extract('('+ pat + ')', expand=False)
print (df)
address State
0 58 Chatham Street, Chatham, New Jersey, 07928 New Jersey
1 3420 W. MacArthur Blvd. Ste. C, Santa Ana, Cal... California
2 2016 Chalk Rd, Wake Forest, North Carolina, 27587 North Carolina
如果匹配的拆分值:
state = ['New York','New Jersey','California','North Carolina']
df1 = df['address'].str.split(', ', expand=True)
df['State'] = df1.where(df1.isin(state)).ffill(1).iloc[:, -1]
print (df)
address State
0 58 Chatham Street, Chatham, New Jersey, 07928 New Jersey
1 3420 W. MacArthur Blvd. Ste. C, Santa Ana, Cal... California
2 2016 Chalk Rd, Wake Forest, North Carolina, 27587 North Carolina
答案 1 :(得分:1)
尝试:
state = ['New York','New Jersey','California','North Carolina']
def search_states(df):
for i in state:
if i in df['address']:
df['states'] = i
break
else:
continue
return df
df = df.apply(search_states, axis = 1)
这种方法在处理更大数据时也会更快。