Question

我有两列如此：

                                       string                    s
0    the best new york cheesecake new york ny             new york
1               houston public school houston              houston

我想删除s中string的最后一次出现。对于上下文，我的DataFrame有数十万行。我知道str.replace和str.rfind，但没有任何可以实现两者的理想组合，而且我在即兴创作解决方案时也是空白。

提前感谢您的帮助！

Answer 1

您可以使用rsplit和join：

df.apply(lambda x: ''.join(x['string'].rsplit(x['s'],1)),axis=1)

输出：

0    the best new york cheesecake  ny
1              houston public school 
dtype: object

编辑：

df['string'] = df.apply(lambda x: ''.join(x['string'].rsplit(x['s'],1)),axis=1).str.replace('\s\s',' ')

print(df)

输出：

                            string         s  third
0  the best new york cheesecake ny  new york      1
1           houston public school    houston      1

Answer 2

选项1
带有理解力的矢量化rsplit

from numpy.core.defchararray import rsplit

v = df.string.values.astype(str)
s = df.s.values.astype(str)

df.assign(string=[' '.join([x.strip() for x in y]) for y in rsplit(v, s, 1)])

                            string         s
0  the best new york cheesecake ny  new york
1           houston public school    houston

选项2
使用re.sub
这里的正则表达式查找s中未跟随另一个相同值的值。

import re

v = df.string.values.astype(str)
s = df.s.values.astype(str)
f = lambda i, j: re.sub(r' *{0} *(?!.*{0}.*)'.format(i), ' ', j).strip()

df.assign(string=[f(i, j) for i, j in zip(s, v)])

                            string         s
0  the best new york cheesecake ny  new york
1            houston public school   houston

str.replace从pandas DataFrame的后面开始

2 个答案: