合并列时如何保留所有唯一的值组合?

时间:2017-10-01 18:11:15

标签: python pandas dataframe merge duplicates

给出表格

| A | B | C | C | C | D | D |
  1   0   x   y   z   8   9
  2   4   x   b               

返回的最佳方法有哪些

| A | B | C | D |
  1   0   x   8
  1   0   y   8
  1   0   z   8
  1   0   x   9
  1   0   y   9
  1   0   z   9
  2   4   x
  2   4   b

我正在使用pandas read_csv从csv中撤出...不确定我是否可以在那里,或使用SQL或Python dicts处理它。

努力搜索,找不到答案。

(我是新手,所以我可能会遗漏一些基本的东西......)

编辑:需要容纳 n

2 个答案:

答案 0 :(得分:2)

import pandas as pd

df = pd.DataFrame([[1,0,'x','y','z',8,9]], columns=list('ABCCCDD'))

result = pd.MultiIndex.from_product(
             [grp for key, grp in df.T.groupby(level=0)[0]]).to_frame(index=False)
print(result)

产量

   0  1  2  3
0  1  0  x  8
1  1  0  x  9
2  1  0  y  8
3  1  0  y  9
4  1  0  z  8
5  1  0  z  9

如果您的DataFrame有多行:

import numpy as np
import pandas as pd

def row_to_arrays(row, idx):
    """
    Split a row into a list of component arrays.
    idx specifies the indices at which we want to split the row
    """
    # Use row[1:] because the first item in each row is the index 
    # (which we want to ignore)
    result = np.split(row[1:], idx)
    # Filter out empty strings
    result = [arr[arr != ''] for arr in result]
    # Filter out empty arrays
    result = [arr for arr in result if len(arr)]
    return result

def arrays_to_dataframe(arrays):
    """
    Convert list of arrays to product DataFrame
    """
    return pd.MultiIndex.from_product(arrays).to_frame(index=False) 

def df_to_row_product(df):
    # find the indices at which to cut each row
    idx = pd.DataFrame(df.columns).groupby(0)[0].agg(lambda x: x.index[0])[1:]
    data = [arrays_to_dataframe(row_to_arrays(row, idx))
            for row in df.itertuples()]
    result = pd.concat(data, ignore_index=True).fillna('')
    return result

df = pd.DataFrame([[1,0,'x','y','z',8,9],
                   [2,4,'x','b','','','']], columns=list('ABCCCDD'))

print(df_to_row_product(df))

产量

   0  1  2  3
0  1  0  x  8
1  1  0  x  9
2  1  0  y  8
3  1  0  y  9
4  1  0  z  8
5  1  0  z  9
6  2  4  x   
7  2  4  b   

答案 1 :(得分:1)

我可以想到一个可能的解决方案,使用一点预处理和itertools.product

from itertools import product 

prod = list(product(*df.groupby(df.columns, axis=1)\
                  .apply(lambda x: x.values.reshape(-1, )).tolist()))
prod
[(1, 0, 'x', 8),
 (1, 0, 'x', 9),
 (1, 0, 'y', 8),
 (1, 0, 'y', 9),
 (1, 0, 'z', 8),
 (1, 0, 'z', 9)]

df = pd.DataFrame(prod, columns=list('ABCD'))\
                 .sort_values('D').reset_index(drop=1)
df
   A  B  C  D
0  1  0  x  8
1  1  0  y  8
2  1  0  z  8
3  1  0  x  9
4  1  0  y  9
5  1  0  z  9