我正在处理原始数据并尝试从地址列中提取city,state。
Address
xxx Richardson, TX
yyy Plano, TX
xxyy Wylie, TX WO-65758
zzz Waxahachie, TX WO-999786
我使用了拆分列中的最后两个元素,但是如何在包含30k记录的庞大数据集中查找第3行和第4行等数据?
答案 0 :(得分:0)
是否可以像在逗号上拆分字符串一样简单,然后在拆分中获取最后一个令牌/第一个令牌?
addresses = ["xxx Richardson, TX", "xxyy Wylie, TX WO-65758"]
for a in addresses:
asplit = a.split(",")
city = asplit[0].split()[-1]
state = asplit[1].split()[0]
print(", ".join([city, state]))
#Richardson, TX
#Wylie, TX
示例强>
如果您有以下DataFrame:
df = pd.DataFrame(
{
'Address': [
'xxx Richardson, TX',
'yyy Plano, TX',
'xxyy Wylie, TX WO-65758',
'zzz Waxahachie, TX WO-999786'
]
}
)
您可以定义拆分功能:
def extract_city_state(a):
asplit = a.split(",")
city = asplit[0].split()[-1]
state = asplit[1].split()[0]
return city, state
然后apply()
将其发送到地址列,该列将返回两个新列,并join()
将其返回到原始DataFrame:
df.join(
df['Address'].apply(
lambda x: pd.Series(extract_city_state(x), index=["City", "State"])
)
)
# Address City State
#0 xxx Richardson, TX Richardson TX
#1 yyy Plano, TX Plano TX
#2 xxyy Wylie, TX WO-65758 Wylie TX
#3 zzz Waxahachie, TX WO-999786 Waxahachie TX
如果这不起作用,那么如何使用正则表达式进行匹配?
这个应该有效:
import re
pattern = r"[A-Z][a-z]+,\s[A-Z]{2}"
for a in addresses:
matches = re.finditer(pattern, a, re.MULTILINE)
for match in matches:
city, state = match.group().replace(",", "").split()
print(", ".join([city, state]))
#Richardson, TX
#Wylie, TX
匹配:
[A-Z]
:一个大写字母[a-z]+
:任意数量的小写字母,\s
:逗号后跟空格[A-Z]{2}
:2大写字母示例强>
df.join(
df['Address'].str.extract(
r"((?P<City>[A-Z][a-z]+),\s(?P<State>[A-Z]{2}))",
expand=False
)[["City", "State"]]
)
# Address City State
#0 xxx Richardson, TX Richardson TX
#1 yyy Plano, TX Plano TX
#2 xxyy Wylie, TX WO-65758 Wylie TX
#3 zzz Waxahachie, TX WO-999786 Waxahachie TX
备注强>
答案 1 :(得分:0)
我不太明白你想要得到什么,只是将列拆分并将最后两个元素作为城市和州?也许这下面的代码可以帮助你。
df["Address"].apply(lambda x: "".join(x.split()[1:]))
更新:(我更改了第2行的数据,使其包含空格)
df2 = df["Address"].apply(lambda x: x.split(","))
city = df2.apply(lambda x: " ".join(x[0].split()[1:]))
state = df2.apply(lambda x: x[1].split()[0])
result = pd.DataFrame(zip(city, state), columns=["city", "state"])
结果:
Out[13]:
city state
0 Richardson TX
1 Pla Plano TX
2 Wylie TX
3 Waxahachie TX
答案 2 :(得分:0)
我不是重新发明轮子,而是考虑使用现有的地址解析库。有不止一个,所以你可能需要做一些比较。 https://github.com/datamade/usaddress是我过去使用过的。