Question

我有一个包含数千行和两列的DataFrame，如下所示：

                                          string       state
0      the best new york cheesecake rochester ny          ny
1      the best dallas bbq houston tx random str          tx
2   la jolla fish shop of san diego san diego ca          ca
3                                   nothing here          dc

对于每个州，我都有一个正则表达式的所有城市名称（小写）结构如(city1|city2|city3|...)，其中城市的顺序是任意的（但如果需要可以更改）。例如，纽约州的正则表达式包含'new york'和'rochester'（对于德克萨斯州同样包含'dallas'和'houston'，以及'san diego'和{加利福尼亚州{1}}。

我想知道字符串中最后出现的城市是什么（对于观察1,2,3,4，我想要'la jolla'，'rochester'，'houston'，和'san diego'（或其他），分别）。

我从NaN开始，并试图想出像扭转字符串但陷入僵局的事情。

非常感谢您的帮助！

Answer 1

cities = r"new york|dallas|..."

def last_match(s):
    found = re.findall(cities, s)
    return found[-1] if found else ""

df['string'].apply(last_match)
#0    rochester
#1      houston
#2    san diego
#3

Answer 2

您可以使用str.findall，但如果没有匹配为空list，那么需要申请。最后按[-1]选择字符串的最后一项：

cities = r"new york|dallas|rochester|houston|san diego"

print (df['string'].str.findall(cities)
                   .apply(lambda x: x if len(x) >= 1 else ['no match val'])
                   .str[-1])
0       rochester
1         houston
2       san diego
3    no match val
Name: string, dtype: object

（更正＆gt; = 1到＆gt; 1.）

另一个解决方案是有点破解 - 在radd之后不添加任何匹配字符串到每个字符串的开头，并将此字符串添加到城市：

a = 'no match val'
cities = r"new york|dallas|rochester|houston|san diego" + '|' + a

print (df['string'].radd(a).str.findall(cities).str[-1])
0       rochester
1         houston
2       san diego
3    no match val
Name: string, dtype: object

str.extract从pandas DataFrame的后面开始

2 个答案: