Question

我有一个如下所示的DataFrame（df1）

    Hair  Feathers  Legs  Type  Count
 R1  1       NaN     0     1      1
 R2  1        0      Nan   1      32
 R3  1        0      2     1      4
 R4  1       Nan     4     1      27

我想根据每列中值的不同组合来合并行，并且还想为每个合并行添加计数值。结果数据帧（df2）将如下所示：

    Hair  Feathers  Legs  Type  Count
 R1   1      0        0     1     33
 R2   1      0        2     1     36
 R3   1      0        4     1     59

合并的方式是将任何Nan值与0或1合并。在df2中，R1是通过将Feathers（df1，R1）的Nan值与Feathers（df1，R2）的0值。类似地，分支（df1，R1）中的0值与分支（df1，R2）的Nan值合并。然后，将R1（1）和R2（32）的计数相加。以相同的方式合并R2和R3，因为R2（df1）中的Feathers值与R3（df1）类似，并且Nan的Legs值与R3（df1）中的2和R2的计数（32）合并和R3（4）已添加。

我希望这种解释有意义。任何帮助将不胜感激

Answer 1

一种可能的方法是复制包含NaN的每一行，并用该列的值填充它们。

首先，我们需要获取每列可能不为空的唯一值：

unique_values = df.iloc[:, :-1].apply(
       lambda x: x.dropna().unique().tolist(), axis=0).to_dict()   
> unique_values
{'Hair': [1.0], 'Feathers': [0.0], 'Legs': [0.0, 2.0, 4.0], 'Type': [1.0]}

然后遍历数据帧的每一行，并用每一列的可能值替换每个NaN。我们可以使用pandas.DataFrame.iterrows：

mask = df.iloc[:, :-1].isnull().any(axis=1)

# Keep the rows that do not contain `Nan`
# and then added modified rows

list_of_df = [r for i, r in df[~mask].iterrows()]

for row_index, row in df[mask].iterrows(): 

    for c in row[row.isnull()].index: 

        # For each column of the row, replace 
        # Nan by possible values for the column

        for v in unique_values[c]: 

            list_of_df.append(row.copy().fillna({c:v})) 

df_res = pd.concat(list_of_df, axis=1, ignore_index=True).T

结果是一个数据框，其中所有NaN均已填充了该列的可能值：

> df_res

   Hair  Feathers  Legs  Type  Count
0   1.0       0.0   2.0   1.0    4.0
1   1.0       0.0   0.0   1.0    1.0
2   1.0       0.0   0.0   1.0   32.0
3   1.0       0.0   2.0   1.0   32.0
4   1.0       0.0   4.0   1.0   32.0
5   1.0       0.0   4.0   1.0   27.0

要通过Count的可能组合获得['Hair', 'Feathers', 'Legs', 'Type']分组的最终结果，我们只需要做以下事情：

> df_res.groupby(['Hair', 'Feathers', 'Legs', 'Type']).sum().reset_index()  

   Hair  Feathers  Legs  Type  Count
0   1.0       0.0   0.0   1.0   33.0
1   1.0       0.0   2.0   1.0   36.0
2   1.0       0.0   4.0   1.0   59.0

希望它有用

更新

如果该行中的一个或多个元素丢失，则该过程将同时为丢失的值查找所有可能的组合。让我们添加一个缺少两个元素的新行：

> df

   Hair  Feathers  Legs  Type  Count
0   1.0       NaN   0.0   1.0    1.0
1   1.0       0.0   NaN   1.0   32.0
2   1.0       0.0   2.0   1.0    4.0
3   1.0       NaN   4.0   1.0   27.0
4   1.0       NaN   NaN   1.0   32.0

我们将以类似的方式进行，但是将使用itertools.product获得替换组合：

 import itertools 

 unique_values = df.iloc[:, :-1].apply(
       lambda x: x.dropna().unique().tolist(), axis=0).to_dict()

 mask = df.iloc[:, :-1].isnull().any(axis=1) 

 list_of_df = [r for i, r in df[~mask].iterrows()] 

 for row_index, row in df[mask].iterrows():  

     cols = row[row.isnull()].index.tolist() 

     for p in itertools.product(*[unique_values[c] for c in cols]): 

         list_of_df.append(row.copy().fillna({c:v for c, v in zip(cols, p)}))

 df_res = pd.concat(list_of_df, axis=1, ignore_index=True).T       


> df_res.sort_values(['Hair', 'Feathers', 'Legs', 'Type']).reset_index(drop=True)

Hair  Feathers  Legs  Type  Count
1   1.0       0.0   0.0   1.0    1.0
2   1.0       0.0   0.0   1.0   32.0
6   1.0       0.0   0.0   1.0   32.0
0   1.0       0.0   2.0   1.0    4.0
3   1.0       0.0   2.0   1.0   32.0
7   1.0       0.0   2.0   1.0   32.0
4   1.0       0.0   4.0   1.0   32.0
5   1.0       0.0   4.0   1.0   27.0
8   1.0       0.0   4.0   1.0   32.0

> df_res.groupby(['Hair', 'Feathers', 'Legs', 'Type']).sum().reset_index()

   Hair  Feathers  Legs  Type  Count
0   1.0       0.0   0.0   1.0   65.0
1   1.0       0.0   2.0   1.0   68.0
2   1.0       0.0   4.0   1.0   91.0

如何在DataFrame中合并具有值组合的行

1 个答案: