Question

我正在尝试用来自另一个数据框的值替换多行熊猫数据框。

假设我的数据帧df1中有10,000行customer_id，并且我想用df2中的3,000个值替换这些customer_id。

为了说明起见，让我们生成数据帧（如下）。

说df1中的这10行代表10,000行，而df2中的3行代表3,000个值。

import numpy as np
import pandas as pd
np.random.seed(42)

# Create df1 with unique values
arr1 = np.arange(100,200,10)
np.random.shuffle(arr1)
df1 = pd.DataFrame(data=arr1, 
                   columns=['customer_id'])

# Create df2 for new unique_values
df2 = pd.DataFrame(data = [1800, 1100, 1500],
                   index = [180, 110, 150], # this is customer_id column on df1
                   columns = ['customer_id_new'])

我想用1800替换180，用1100替换110，用1500替换150。

我知道我们可以在下面做...

# Replace multiple values
replace_values = {180 : 1800, 110 : 1100, 150 : 1500 }                                                                                          
df1_replaced = df1.replace({'customer_id': replace_values})

如果我只有几行，它就可以正常工作...

但是，如果我有成千上万的行需要替换，该如何执行而又不输入要一次更改的值呢？

编辑：为澄清起见，我不需要使用replace。可以最快最有效的方式从df2中的值替换df1中的那些值的任何事情都是可以的。

Answer 1

df1['customer_id'] = df1['customer_id'].replace(df2['customer_id_new'])

或者，您可以就地进行。

df1['customer_id'].replace(df2['customer_id_new'], inplace=True)

Answer 2

您可以将map与pd.Series一起使用：

 df1['customer_id'] = df1['customer_id'].map(df2.squeeze()).fillna(df1['customer_id'])

或

df1['customer_id'] = df1['customer_id'].map(df2['customer_id_new']).fillna(df1['customer_id'])

输出：

   customer_id
0       1800.0
1       1100.0
2       1500.0
3        100.0
4        170.0
5        120.0
6        190.0
7        140.0
8        130.0
9        160.0

Answer 3

使用replace使用原始方法，可以使用to_dict简化它，以创建映射字典，而无需手动进行操作：

df1["customer_id"] = df1["customer_id"].replace(df2["customer_id_new"].to_dict())

>>> df1
   customer_id
0         1800
1         1100
2         1500
3          100
4          170
5          120
6          190
7          140
8          130
9          160

Answer 4

我认为，除了尝试上述有用的答案外，如果您拥有多核处理器，则可以尝试并行化数据帧。

例如：

import pandas as pd, numpy as np, seaborn as sns
from multiprocessing import Pool

num_partitions = 10 #number of partitions to split data-frame
num_cores = 4 #number of cores on your machine

iris = pd.DataFrame(sns.load_dataset('iris'))
def parallelize_dataframe(df, func):
   df_split = np.array_split(df, num_partitions)
   pool = Pool(num_cores)
   df = pd.concat(pool.map(func, df_split))
   pool.close()
   pool.join()
   return df

您可以通过replace方法来代替'func'参数。请告诉我是否有帮助。如有任何错误，请发表评论。

谢谢！

用另一个数据框的值替换熊猫数据框的多个值的最快方法

4 个答案: