熊猫中的映射

时间:2016-08-10 10:28:05

标签: python pandas dataframe mapping

我想在一个公共列上映射两个数据帧 让我们说,

我的第一个数据框:

>>> df
   Task  Emp
0  1     aa
1  1     bb
2  2     cc

我的第二个DataFrame:

>>> df1
   Task  Days
0  1      12
1  2      23

我的要求是:

>>> Result
   Emp  Days
0  aa   12
1  bb   12
2  cc   23

无法在pandas中对DataFrame进行映射。对于大量的记录,最好的方法是什么。

4 个答案:

答案 0 :(得分:3)

使用map

df.rename(columns={'Task':'Days'}, inplace=True)
df['Days'] = df['Days'].map(df1.set_index('Task')['Days'])
df = df[['Emp','Days']]
print (df)
  Emp  Days
0  aa    12
1  bb    12
2  cc    23

答案 1 :(得分:2)

尝试:

pd.concat([d.set_index('Task') for d in [df, df1]], axis=1).reset_index(drop=True)

enter image description here

正如@Borja所指出的

  

@piRSquared顺便说一句,你使用concat的方法会失败   有重复的值。例如:df:任务Emp 5 cc 4 cc 8 cc 3 aa   2 aa 6 aa 4 bb 6 cc df1:任务第1天5 7 3 0 6 6 7 8 1 9 7 5 9 9 3 3 8

这不应该是答案。

答案 2 :(得分:2)

我认为你要找的是合并:

pd.merge(df, df1, on='Task')

输出:

    Emp Days
0   aa  12
1   bb  12
2   cc  23

如果您的数据框很大(特别是如果您在两个数据框中都有重复的值'任务'),您将遇到内存问题。这不是特定于合并功能,而是来自于它将加入' Emp'和' Days'关于'任务'的每个共同价值。

答案 3 :(得分:1)

除了已经给出的答案之外,我还对piRSquaredBorjajezrael的答案进行了小型性能测试:

import timeit
import numpy as np

setup = """
import pandas as pd
import numpy as np
import string

# number of unique tasks
numTasks = %s
# number of rows in df
numRows = %s

## creating df
# columns for df
col1 = np.random.choice( range(numTasks), numRows )
col2 = np.random.choice( list(string.letters), numRows )

df = pd.DataFrame( { 'Task': col1,
                     'Emp':col2} )
df = df.sort_values( "Task" ).reset_index( drop=True )

# creating df1
tasks = df.Task.unique()
nTasks = len(tasks)

df1 = pd.DataFrame( { 'Task': tasks,
                      'Days': np.random.permutation( range(nTasks) ) } )

"""

solutionPiRSquared = """
pd.concat([d.set_index('Task') for d in [df, df1]], axis=1).reset_index(drop=True)
"""

solutionBorja = """
pd.merge(df, df1, on='Task')
"""

solutionJezrael = """
df.rename(columns={'Task':'Days'}, inplace=True)
df['Days'] = df['Days'].map(df1.set_index('Task')['Days'])
df = df[['Emp','Days']]
"""

numRepetitions = int( 100 )

solutions = [ { 'by': 'piRSquared',
                'code': solutionPiRSquared,
                'min': None,
                'max': None,
                'mean': None,
                'std': None },
              { 'by': 'borja',
                'code': solutionBorja,
                'min': None,
                'max': None,
                'mean': None,
                'std': None },
              { 'by': 'jezrael',
                'code':  solutionJezrael,
                'min': None,
                'max': None,
                'mean': None,
                'std': None  } ]

# test several settings for number of tasks and number of rows
# for each setup each solution is executed <numRepetition> times
# and execution time is measured. min, max, mean, and standard 
# deviation is calculated.
for (NUM_TASKS,NUM_ROWS) in [ (10,1000),
                              (100,10000),
                              (1000,10000),
                              (1000,100000),
                              (1000,1000000),
                              (10000,1000000),
                              (100000,1000000)]:
    print "-----------------------------------------"
    print "number of rows:",NUM_ROWS
    print "number of tasks:",NUM_TASKS
    print

    for solution in solutions:
        #print "solution by",solution['by']
        result = np.array( timeit.repeat( solution["code"], setup=setup % (NUM_TASKS,NUM_ROWS), number=1, repeat=numRepetitions ) )

        solution['min'] = result.min()
        solution['max'] = result.max()
        solution['mean'] = result.mean()
        solution['std'] = result.std()

    # sort solutions regarding the their mean value
    solutions.sort( key=lambda s: s['mean'] )

    best = solutions[0]['mean']

    # print sorted results along with relative increase of
    # execution time relative to the fastest solution (for current
    # setup
    for idx,solution in enumerate(solutions):
        d = { 'idx': idx+1,
              'rel': "[rel to best: +{:.2f}]".format(100*(solution['mean']-best)/best) if idx>0 else '[best]',
              'by': solution["by"],
              'min': solution["min"],
              'max': solution["max"],
              'mean': solution["mean"],
              'std': solution["std"] }

        print "{idx}. {rel}: solution by {by}".format( **d )
        print "    min: {min:.4f}, mean: {mean:.4f}, std: {std:.4f}, max: {max:.4f})".format( **d )

    print "-----------------------------------------"

numTasks的几个设置,即df和numRows中的唯一任务的数量,即df中的行数,并计算执行时间的统计。这让我在我的机器上(英特尔®酷睿™2双核CPU P8700 @ 2.53GHz×2)与python2.7:

-----------------------------------------
number of rows: 1000
number of tasks: 10

1. [best]: solution by borja
    min: 0.0020, mean: 0.0021, std: 0.0001, max: 0.0026)
2. [rel to best: +3.12]: solution by piRSquared
    min: 0.0021, mean: 0.0022, std: 0.0002, max: 0.0030)
3. [rel to best: +14.46]: solution by jezrael
    min: 0.0023, mean: 0.0024, std: 0.0002, max: 0.0032)
-----------------------------------------
-----------------------------------------
number of rows: 10000
number of tasks: 100

1. [best]: solution by piRSquared
    min: 0.0026, mean: 0.0028, std: 0.0002, max: 0.0040)
2. [rel to best: +13.39]: solution by borja
    min: 0.0028, mean: 0.0031, std: 0.0009, max: 0.0119)
3. [rel to best: +23.38]: solution by jezrael
    min: 0.0033, mean: 0.0034, std: 0.0002, max: 0.0043)
-----------------------------------------
-----------------------------------------
number of rows: 10000
number of tasks: 1000

1. [best]: solution by piRSquared
    min: 0.0027, mean: 0.0030, std: 0.0003, max: 0.0044)
2. [rel to best: +5.63]: solution by borja
    min: 0.0030, mean: 0.0031, std: 0.0002, max: 0.0040)
3. [rel to best: +22.01]: solution by jezrael
    min: 0.0034, mean: 0.0036, std: 0.0002, max: 0.0046)
-----------------------------------------
-----------------------------------------
number of rows: 100000
number of tasks: 1000

1. [best]: solution by piRSquared
    min: 0.0092, mean: 0.0099, std: 0.0008, max: 0.0141)
2. [rel to best: +39.06]: solution by borja
    min: 0.0130, mean: 0.0137, std: 0.0009, max: 0.0170)
3. [rel to best: +71.95]: solution by jezrael
    min: 0.0163, mean: 0.0170, std: 0.0006, max: 0.0192)
-----------------------------------------
-----------------------------------------
number of rows: 1000000
number of tasks: 1000

1. [best]: solution by piRSquared
    min: 0.0882, mean: 0.0915, std: 0.0025, max: 0.1013)
2. [rel to best: +50.27]: solution by borja
    min: 0.1256, mean: 0.1375, std: 0.0104, max: 0.1828)
3. [rel to best: +75.97]: solution by jezrael
    min: 0.1557, mean: 0.1610, std: 0.0047, max: 0.1862)
-----------------------------------------
-----------------------------------------
number of rows: 1000000
number of tasks: 10000

1. [best]: solution by piRSquared
    min: 0.0887, mean: 0.0949, std: 0.0059, max: 0.1282)
2. [rel to best: +41.71]: solution by borja
    min: 0.1247, mean: 0.1345, std: 0.0055, max: 0.1621)
3. [rel to best: +84.01]: solution by jezrael
    min: 0.1668, mean: 0.1746, std: 0.0072, max: 0.2146)
-----------------------------------------
-----------------------------------------
number of rows: 1000000
number of tasks: 100000

1. [best]: solution by piRSquared
    min: 0.0959, mean: 0.1006, std: 0.0036, max: 0.1177)
2. [rel to best: +51.91]: solution by borja
    min: 0.1473, mean: 0.1528, std: 0.0047, max: 0.1800)
3. [rel to best: +77.68]: solution by jezrael
    min: 0.1730, mean: 0.1787, std: 0.0059, max: 0.2087)
-----------------------------------------
在此上下文中,

concat优于mergemap