Question

我有两个数据框：

a  b  c  Result
-  -  -  ------
x  x  x  3
n  y  z  4
n  n  n  null

如何根据“列”中的“项目”将“结果”添加到df2？预期的数据帧df2是：

addLayers

上述问题如何与3个问题重复，其中2个问题用'或'标记为@smci？

Answer 1

这比初看起来要复杂得多。 df1是长形式，它有两个'b'条目。首先需要将其堆叠/取消堆叠/旋转到“结果”的3x3表中，其中“列”成为索引，“Item”=“x”/“y”/“z”的值扩展为带有NaN的完整3x3矩阵，用于缺失值：

>>> df1_full = df1.pivot(index='Column', columns='Item', values='Result')
Item      x    y    z
Column               
a       3.0  NaN  NaN
b       NaN  4.0  5.0
c       6.0  NaN  NaN

（注意不需要的类型转换为float，这是因为numpy对于整数没有NaN，请参阅pre-pandas-0.22.0版本中的Issue 17013。没问题，我们只会退回到最后的int。）

现在我们想要df1_full.merge(df2, left_index=True, right_on=??)

但首先我们需要另一个技巧/中间列来查找df2中最左边的有效值，该值对应于来自df1的有效列名;值n无效，也许我们将其替换为NaN以简化生活：

>>> df2.replace('n', np.NaN)
     a    b    c
0    x    x    x
1  NaN    y    z
2  NaN  NaN  NaN

>>> df2_nan.columns = [0,1,2]

     0    1    2
0    x    x    x
1  NaN    y    z
2  NaN  NaN  NaN

我们希望从L-to-R连续测试df2列的值是否为in df1_full.columns，类似于Computing the first non-missing value from each column in a DataFrame ，除了测试连续列（{{ 1}}）。然后将该中间列名称存储到新列“join_col”：

axis=1

实际上我们想要索引df1的列名，但它会在NaN上爆炸：

>>> df2['join_col'] = df2.replace('n', np.NaN).apply(pd.Series.first_valid_index, axis=1)

   a  b  c join_col
0  x  x  x        a
1  n  y  z        b
2  n  n  n     None

（嗯，这不完全正常，但你明白了。）

最后我们进行合并>>> df1.columns[ df2_nan.apply(pd.Series.first_valid_index, axis=1) ]。并且可以采用所需的列切片df1_full.merge(df2, left_index=True, right_on='join_col')。并将['a','b','c','Result']转换回int，或映射'Nan' - ＆gt; 'NULL'。

使用列名映射pandas列

1 个答案: