我想在一个公共列上映射两个数据帧 让我们说,
我的第一个数据框:
>>> df
Task Emp
0 1 aa
1 1 bb
2 2 cc
我的第二个DataFrame:
>>> df1
Task Days
0 1 12
1 2 23
我的要求是:
>>> Result
Emp Days
0 aa 12
1 bb 12
2 cc 23
无法在pandas中对DataFrame进行映射。对于大量的记录,最好的方法是什么。
答案 0 :(得分:3)
使用map
:
df.rename(columns={'Task':'Days'}, inplace=True)
df['Days'] = df['Days'].map(df1.set_index('Task')['Days'])
df = df[['Emp','Days']]
print (df)
Emp Days
0 aa 12
1 bb 12
2 cc 23
答案 1 :(得分:2)
尝试:
pd.concat([d.set_index('Task') for d in [df, df1]], axis=1).reset_index(drop=True)
正如@Borja所指出的
@piRSquared顺便说一句,你使用concat的方法会失败 有重复的值。例如:df:任务Emp 5 cc 4 cc 8 cc 3 aa 2 aa 6 aa 4 bb 6 cc df1:任务第1天5 7 3 0 6 6 7 8 1 9 7 5 9 9 3 3 8
这不应该是答案。
答案 2 :(得分:2)
我认为你要找的是合并:
pd.merge(df, df1, on='Task')
输出:
Emp Days
0 aa 12
1 bb 12
2 cc 23
如果您的数据框很大(特别是如果您在两个数据框中都有重复的值'任务'),您将遇到内存问题。这不是特定于合并功能,而是来自于它将加入' Emp'和' Days'关于'任务'的每个共同价值。
答案 3 :(得分:1)
除了已经给出的答案之外,我还对piRSquared,Borja和jezrael的答案进行了小型性能测试:
import timeit
import numpy as np
setup = """
import pandas as pd
import numpy as np
import string
# number of unique tasks
numTasks = %s
# number of rows in df
numRows = %s
## creating df
# columns for df
col1 = np.random.choice( range(numTasks), numRows )
col2 = np.random.choice( list(string.letters), numRows )
df = pd.DataFrame( { 'Task': col1,
'Emp':col2} )
df = df.sort_values( "Task" ).reset_index( drop=True )
# creating df1
tasks = df.Task.unique()
nTasks = len(tasks)
df1 = pd.DataFrame( { 'Task': tasks,
'Days': np.random.permutation( range(nTasks) ) } )
"""
solutionPiRSquared = """
pd.concat([d.set_index('Task') for d in [df, df1]], axis=1).reset_index(drop=True)
"""
solutionBorja = """
pd.merge(df, df1, on='Task')
"""
solutionJezrael = """
df.rename(columns={'Task':'Days'}, inplace=True)
df['Days'] = df['Days'].map(df1.set_index('Task')['Days'])
df = df[['Emp','Days']]
"""
numRepetitions = int( 100 )
solutions = [ { 'by': 'piRSquared',
'code': solutionPiRSquared,
'min': None,
'max': None,
'mean': None,
'std': None },
{ 'by': 'borja',
'code': solutionBorja,
'min': None,
'max': None,
'mean': None,
'std': None },
{ 'by': 'jezrael',
'code': solutionJezrael,
'min': None,
'max': None,
'mean': None,
'std': None } ]
# test several settings for number of tasks and number of rows
# for each setup each solution is executed <numRepetition> times
# and execution time is measured. min, max, mean, and standard
# deviation is calculated.
for (NUM_TASKS,NUM_ROWS) in [ (10,1000),
(100,10000),
(1000,10000),
(1000,100000),
(1000,1000000),
(10000,1000000),
(100000,1000000)]:
print "-----------------------------------------"
print "number of rows:",NUM_ROWS
print "number of tasks:",NUM_TASKS
print
for solution in solutions:
#print "solution by",solution['by']
result = np.array( timeit.repeat( solution["code"], setup=setup % (NUM_TASKS,NUM_ROWS), number=1, repeat=numRepetitions ) )
solution['min'] = result.min()
solution['max'] = result.max()
solution['mean'] = result.mean()
solution['std'] = result.std()
# sort solutions regarding the their mean value
solutions.sort( key=lambda s: s['mean'] )
best = solutions[0]['mean']
# print sorted results along with relative increase of
# execution time relative to the fastest solution (for current
# setup
for idx,solution in enumerate(solutions):
d = { 'idx': idx+1,
'rel': "[rel to best: +{:.2f}]".format(100*(solution['mean']-best)/best) if idx>0 else '[best]',
'by': solution["by"],
'min': solution["min"],
'max': solution["max"],
'mean': solution["mean"],
'std': solution["std"] }
print "{idx}. {rel}: solution by {by}".format( **d )
print " min: {min:.4f}, mean: {mean:.4f}, std: {std:.4f}, max: {max:.4f})".format( **d )
print "-----------------------------------------"
numTasks
的几个设置,即df和numRows
中的唯一任务的数量,即df中的行数,并计算执行时间的统计。这让我在我的机器上(英特尔®酷睿™2双核CPU P8700 @ 2.53GHz×2)与python2.7:
-----------------------------------------
number of rows: 1000
number of tasks: 10
1. [best]: solution by borja
min: 0.0020, mean: 0.0021, std: 0.0001, max: 0.0026)
2. [rel to best: +3.12]: solution by piRSquared
min: 0.0021, mean: 0.0022, std: 0.0002, max: 0.0030)
3. [rel to best: +14.46]: solution by jezrael
min: 0.0023, mean: 0.0024, std: 0.0002, max: 0.0032)
-----------------------------------------
-----------------------------------------
number of rows: 10000
number of tasks: 100
1. [best]: solution by piRSquared
min: 0.0026, mean: 0.0028, std: 0.0002, max: 0.0040)
2. [rel to best: +13.39]: solution by borja
min: 0.0028, mean: 0.0031, std: 0.0009, max: 0.0119)
3. [rel to best: +23.38]: solution by jezrael
min: 0.0033, mean: 0.0034, std: 0.0002, max: 0.0043)
-----------------------------------------
-----------------------------------------
number of rows: 10000
number of tasks: 1000
1. [best]: solution by piRSquared
min: 0.0027, mean: 0.0030, std: 0.0003, max: 0.0044)
2. [rel to best: +5.63]: solution by borja
min: 0.0030, mean: 0.0031, std: 0.0002, max: 0.0040)
3. [rel to best: +22.01]: solution by jezrael
min: 0.0034, mean: 0.0036, std: 0.0002, max: 0.0046)
-----------------------------------------
-----------------------------------------
number of rows: 100000
number of tasks: 1000
1. [best]: solution by piRSquared
min: 0.0092, mean: 0.0099, std: 0.0008, max: 0.0141)
2. [rel to best: +39.06]: solution by borja
min: 0.0130, mean: 0.0137, std: 0.0009, max: 0.0170)
3. [rel to best: +71.95]: solution by jezrael
min: 0.0163, mean: 0.0170, std: 0.0006, max: 0.0192)
-----------------------------------------
-----------------------------------------
number of rows: 1000000
number of tasks: 1000
1. [best]: solution by piRSquared
min: 0.0882, mean: 0.0915, std: 0.0025, max: 0.1013)
2. [rel to best: +50.27]: solution by borja
min: 0.1256, mean: 0.1375, std: 0.0104, max: 0.1828)
3. [rel to best: +75.97]: solution by jezrael
min: 0.1557, mean: 0.1610, std: 0.0047, max: 0.1862)
-----------------------------------------
-----------------------------------------
number of rows: 1000000
number of tasks: 10000
1. [best]: solution by piRSquared
min: 0.0887, mean: 0.0949, std: 0.0059, max: 0.1282)
2. [rel to best: +41.71]: solution by borja
min: 0.1247, mean: 0.1345, std: 0.0055, max: 0.1621)
3. [rel to best: +84.01]: solution by jezrael
min: 0.1668, mean: 0.1746, std: 0.0072, max: 0.2146)
-----------------------------------------
-----------------------------------------
number of rows: 1000000
number of tasks: 100000
1. [best]: solution by piRSquared
min: 0.0959, mean: 0.1006, std: 0.0036, max: 0.1177)
2. [rel to best: +51.91]: solution by borja
min: 0.1473, mean: 0.1528, std: 0.0047, max: 0.1800)
3. [rel to best: +77.68]: solution by jezrael
min: 0.1730, mean: 0.1787, std: 0.0059, max: 0.2087)
-----------------------------------------
在此上下文中,