Question

我有两个DataFrames df1和df2，列数很多

df1 - [2756003行x 44列]

df2 - [22035行x 11列]

我需要将新列添加到df2，并使用df1中的目标列的平均值基于分组结果（对于df1和df2中的相同列）

IndexError: arrays used as indices must be of integer (or boolean) type

返回：

category  manufacturer
1         2                0.000000
          4                8.796840
          10               2.312407
          19               1.135094
          24               4.355000

t2 - 存储多指数系列，如：

In [302]: t2[1, 2]
Out[302]: 0.0

如果我使用现有索引，我将获得预期结果

ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

但是，如果我打电话给t2 [410,332]，其中332是制造商的ID，其在df2中呈现而未在df1中呈现，我将获得

df2['manufacturer'].map(t2)

我希望得到NaN，就像我们从

那里得到的那样

{{1}}

如果只有一列。

Answer 1

使用pd.merge合并df2和t2：

df2 = pd.merge(df2, t2.reset_index(), on=['category','manufacturer'], how='left')

默认情况下，pd.merge加入所有共享列，如果'category'和'manufacturer'是唯一的列df2和t2.reset_index()共有，然后上面的行可以简化为

df2 = pd.merge(df2, t2.reset_index(), how='left')

import numpy as np
import pandas as pd
np.random.seed(2017)

df1 = pd.DataFrame(np.random.randint(4, size=(100,3)), columns=['category', 'manufacturer', 'col'])

df2 = pd.DataFrame(np.random.randint(1, 5, size=(100,3)), columns=['category', 'manufacturer', 'col2'])

t1 = df1.groupby(['category', 'manufacturer'])
c1 = 'col'
t2 = t1[c1].mean()
col = ['foo', 'bar']
str1='_'.join(col)
t2.name = c1+'_'+str1+'_mean'
df2 = pd.merge(df2, t2.reset_index(), on=['category','manufacturer'], how='left')
print(df2.head())

打印

   category  manufacturer  col2  col_foo_bar_mean
0         1             1     2          1.333333
1         3             4     3               NaN
2         4             4     2               NaN
3         3             3     1          1.000000
4         3             2     1          1.777778

由于这是一个“左连接”，df2中没有对应的行 t2中的行被分配NaN用于缺少值的列。

如果没有在多指数系列中找到指标，如何返回NaN？

1 个答案: