Question

我有一个带有多个字符串列的数据框。我想使用对数据框的多个列上的一系列有效的字符串方法。我希望这样的事情：

df = pd.DataFrame({'A': ['123f', '456f'], 'B': ['789f', '901f']})
df

Out[15]: 
      A     B
0  123f  789f
1  456f  901f

df = df.str.rstrip('f')
df
Out[16]: 
     A    B
0  123  789
1  456  901

很显然，这是行不通的，因为str操作仅对pandas Series对象有效。什么是/最适合熊猫的方法？

Answer 1

功能rstrip与Series一起使用，因此可以使用apply：

df = df.apply(lambda x: x.str.rstrip('f'))

或通过stack和最后unstack创建Series：

df = df.stack().str.rstrip('f').unstack()

或使用applymap：

df = df.applymap(lambda x: x.rstrip('f'))

最后一次需要在某些列上应用功能

#add columns to lists
cols = ['A']
df[cols] = df[cols].apply(lambda x: x.str.rstrip('f'))
df[cols] = df[cols].stack().str.rstrip('f').unstack()
df[cols] = df[cols].stack().str.rstrip('f').unstack()

Answer 2

您可以将import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns np.random.seed(786) df = pd.DataFrame({'a':np.round(np.arange(0, 1, 0.05),2), 'b':np.round(np.random.rand(20),2) - .5}) plt.figure(figsize=(10,5)) ax = sns.barplot(x = 'a', y = 'b', data = df) ax.set_xticklabels(ax.get_xticklabels(), rotation=90) ax2 = ax.twiny() ax2.vlines(x = 0.45, ymin = 0, ymax = 0.6, color = 'red', linewidth=2) #ax2.set_visible(False) # this hides the ticks on the top of the plot与rstrip一起使用来模拟replace的行为，该行为可以应用于整个regex=True：

DataFrame

df.replace(r'f$', '', regex=True)

由于A B 0 123 789 1 456 901需要去除一系列字符，因此您可以轻松地扩展它：

rstrip

Answer 3

您可以使用字典理解并将其提供给pd.DataFrame构造函数：

res = pd.DataFrame({col: [x.rstrip('f') for x in df[col]] for col in df})

当前，熊猫str方法效率低下。正则表达式效率更低，但更易于扩展。与往常一样，您应该对数据进行测试。

# Benchmarking on Python 3.6.0, Pandas 0.19.2

def jez1(df):
    return df.apply(lambda x: x.str.rstrip('f'))

def jez2(df):
    return df.applymap(lambda x: x.rstrip('f'))

def jpp(df):
    return pd.DataFrame({col: [x.rstrip('f') for x in df[col]] for col in df})

def user3483203(df):
    return df.replace(r'f$', '', regex=True)

df = pd.concat([df]*10000)

%timeit jez1(df)         # 33.1 ms per loop
%timeit jez2(df)         # 29.9 ms per loop
%timeit jpp(df)          # 13.2 ms per loop
%timeit user3483203(df)  # 42.9 ms per loop

如何将字符串方法应用于数据框的多个列

3 个答案: