Question

是否可以仅使用pandas.Series.str方法将一列中的字符串替换为pandas数据框中另一列中的对应字符串？只要是，“否”是可以接受的答案随附熊猫版本和文档的相关部分。

这是一个例子：

import pandas as pd
# version >= 0.19.2
df = pd.DataFrame(
    {
        'names': ['alice', 'bob', 'catherine', 'slagathor'],
        'hobbies': [
            'alice likes to knit',
            'bob likes to bowl',
            'plays with her cats',
            'slagathor burniates peasants for fun'
        ]
    }
)

def clean(df: pd.DataFrame) -> pd.Dataframe: ... # do the substitutions

assert all(
    clean(df).hobbies == pd.Series([
        'likes to knit',
        'likes to bowl',
        'plays with her cats',
        'burniates peasants for fun'
    ])
)

在这种情况下，我想使用类似

的方法从name列的hobbies列中省略字符串

df.hobbies.str.replace('(' + df.names + r'\s*)?', '')  # doesn't work

到目前为止，我不得不

import re
df['replaced'] = pd.Series(
    re.sub(f'^{df.names[i]} ?', '', df.hobbies[i]) for i in df.index
)

如对Replace values from one column with another column Pandas DataFrame的回答

Answer 1

str.replace是 Series 方法，因此可以应用于每个元素列，但无法引用任何其他列。

因此，您必须 import re 并在函数内使用re.sub 应用于每个行（以便该功能可以引用其他当前行的列）。

您的任务可以在一条指令中完成：

df['replaced'] = df.apply(lambda row: re.sub(
    '^' + row.names + r'\s*', '', row.hobbies), axis=1)

与使用 for 循环创建系列相比，此解决方案的运行速度更快并在列下替换，因为 apply 需要关心遍历DataFrame，所以应用的函数负责仅用于生成要放入当前行的值。

关于执行速度的一个重要因素也是您避免在循环中每次都按索引定位当前行。

如果索引为其他索引，代码也将失败而不是从0开始的连续数字。尝试例如用index=np.arange(1, 5)创建您的DataFrame 参数。

将一列替换为另一列中的字符串

1 个答案: