熊猫根据列值重复行

时间:2019-06-13 15:38:58

标签: python python-3.x pandas dataframe transformation

给出以下数据框

data = [[1, 'Yes','A','No','Yes','No','No','No'],
        [2, 'Yes','A','No','No','Yes','No','No'],
        [3, 'Yes','B','No','No','Yes','No','No'],
        [4, 'No','','','','','',''],
        [5, 'No','','','','','',''],
        [6, 'Yes','C','No','No','Yes','Yes','No'],
        [7, 'Yes','A','No','Yes','No','No','No'],
        [8, 'Yes','A','No','No','Yes','No','No'],
        [9, 'No','','','','','',''],
        [10, 'Yes','B','Yes','Yes','No','No','No']]
df = pd.DataFrame(data,columns=['Cust_ID','OrderMade','OrderType','OrderCategoryA','OrderCategoryB','OrderCategoryC','OrderCategoryD'])


+----+-----------+-------------+-------------+------------------+------------------+------------------+------------------+
|    |   Cust_ID | OrderMade   | OrderType   | OrderCategoryA   | OrderCategoryB   | OrderCategoryC   | OrderCategoryD   |
|----+-----------+-------------+-------------+------------------+------------------+------------------+------------------|
|  0 |         1 | Yes         | A           | No               | Yes              | No               | No               |
|  1 |         2 | Yes         | A           | No               | No               | Yes              | No               |
|  2 |         3 | Yes         | B           | No               | No               | Yes              | No               |
|  3 |         4 | No          |             |                  |                  |                  |                  |
|  4 |         5 | No          |             |                  |                  |                  |                  |
|  5 |         6 | Yes         | C           | No               | No               | Yes              | Yes              |
|  6 |         7 | Yes         | A           | No               | Yes              | No               | No               |
|  7 |         8 | Yes         | A           | No               | No               | Yes              | No               |
|  8 |         9 | No          |             |                  |                  |                  |                  |
|  9 |        10 | Yes         | B           | Yes              | Yes              | No               | No               |
+----+-----------+-------------+-------------+------------------+------------------+------------------+------------------+

如何将其转换为基于OrderCategory的行?

+--------+-----------+----------+----------------+
|Cust_ID | OrderMade |OrderType | OrderCategory  |
|--------+-----------+----------+----------------|
|1       |   Yes     |    A     | OrderCategoryB |
|2       |   Yes     |    A     | OrderCategoryC |
|3       |   Yes     |    B     | OrderCategoryC |
|4       |   No      |          |                |
|5       |   No      |          |                |
|6       |   Yes     |    C     | OrderCategoryC |
|6       |   Yes     |    C     | OrderCategoryD |
|7       |   Yes     |    A     | OrderCategoryB |
|8       |   Yes     |    A     | OrderCategoryC |
|9       |   No      |          |                |
|10      |   Yes     |    B     | OrderCategoryA |
|10      |   Yes     |    B     | OrderCategoryB |
+--------+-----------+----------+----------------+

我尝试使用crosstab以一个OrderCategory开头,并计划为每个类别重复一次,但这似乎效率不高,而且我不确定如何继续获得期望的结果。

imgCROSS = pd.crosstab(df["Cust_ID"], df["OrderCategoryA"])

返回...

OrderCategoryA     No  Yes
Cust_ID                   
1               0   1    0
2               0   1    0
3               0   1    0
4               1   0    0
5               1   0    0
6               0   1    0
7               0   1    0
8               0   1    0
9               1   0    0
10              0   0    1

我还认为我可以填充一个名为Category的新空列并遍历每一行,并根据Yes/No值填充适当的类别,但这不适用于具有多个行的行。类别。另外,此想法的以下实现返回一个空列。

imgRaw["Category"] = ""
for index, row in df.iterrows():
    catA = row["OrderCategoryA"]
    catB = row["OrderCategoryB"]
    catC = row["OrderCategoryC"]
    catD = row["OrderCategoryD"]

    if catA == "Yes":
        row["Category"] = "OrderCategoryA"
    elif catB == "Yes":
        row["Category"] = "OrderCategoryB"
    elif catC == "Yes":
        row["Category"] = "OrderCategoryC"
    elif catD == "Yes":
        row["Category"] = "OrderCategoryD"

我知道我需要转换数据框,可能要多次才能获得所需的结果。只是停留在如何进行。

4 个答案:

答案 0 :(得分:3)

让我们分四个步骤使用熊猫:

df_1 = df.set_index(['Cust_ID', 'OrderMade', 'OrderType'])

df_2 = df_1.where((df_1 == "Yes") | (df_1 == "")).rename_axis('OrderCategory', axis=1).stack().reset_index()

df_2['OrderCategory'] = df_2['OrderCategory'].mask(df_2['OrderMade'] == 'No','')

df_2.drop_duplicates().drop(0, axis=1)

输出:

    Cust_ID OrderMade OrderType   OrderCategory
0         1       Yes         A  OrderCategoryB
1         2       Yes         A  OrderCategoryC
2         3       Yes         B  OrderCategoryC
3         4        No                          
8         5        No                          
13        6       Yes         C  OrderCategoryC
14        6       Yes         C  OrderCategoryD
15        7       Yes         A  OrderCategoryB
16        8       Yes         A  OrderCategoryC
17        9        No                          
22       10       Yes         B  OrderCategoryA
23       10       Yes         B  OrderCategoryB

答案 1 :(得分:1)

这是一种实现方法(我必须修改您的原始数据框,以便它只有一个OrderCategoryD而不是两个...希望是一个错字):

keep_cols = ['Cust_ID','OrderMade','OrderType']
build = pd.DataFrame()

for col in df.columns:
   if 'OrderCategory' in col:
     cat = col[-1:]                              # Get the category letter
     temp = df.loc[df[col] == 'Yes', keep_cols]  # Get all the rows with a yes in this column
     temp['OrderCategory'] = cat                 # Append a column with the correct letter
     build = build.append(temp)                  # Append that df to our new df

# Once that's done, get all the rows that have a 'No' in the OrderMade column
final = pd.merge(build, df[keep_cols], how='right').sort_values('Cust_ID')
final = final.reset_index().drop(columns=['index'])

答案 2 :(得分:1)

添加另一个类别列,代表'No'中的'OrderMade'

这使问题泛化,使我们能够使用更统一的方法。

d = df.assign(**{'': df.OrderMade.map({'Yes': 'No', 'No': 'Yes'})})
ids, cat = np.split(d, [3], 1)  # split between 3rd and 4th columns
i, j = np.where(cat.eq('Yes'))

ids.iloc[i].assign(OrderCategory=cat.columns[j])

  Cust_ID OrderMade OrderType   OrderCategory
0       1       Yes         A  OrderCategoryB
1       2       Yes         A  OrderCategoryC
2       3       Yes         B  OrderCategoryC
3       4        No                          
4       5        No                          
5       6       Yes         C  OrderCategoryC
5       6       Yes         C  OrderCategoryD
6       7       Yes         A  OrderCategoryB
7       8       Yes         A  OrderCategoryC
8       9        No                          
9      10       Yes         B  OrderCategoryA
9      10       Yes         B  OrderCategoryB

melt

添加柱子也简化了熔融

d = df.assign(**{'': df.OrderMade.map({'Yes': 'No', 'No': 'Yes'})})
d.melt(['Cust_ID', 'OrderMade', 'OrderType'], var_name='OrderCategory') \
 .query('value == "Yes"').drop('value', 1).sort_values('Cust_ID')

    Cust_ID OrderMade OrderType   OrderCategory
10        1       Yes         A  OrderCategoryB
21        2       Yes         A  OrderCategoryC
22        3       Yes         B  OrderCategoryC
53        4        No                          
54        5        No                          
25        6       Yes         C  OrderCategoryC
35        6       Yes         C  OrderCategoryD
16        7       Yes         A  OrderCategoryB
27        8       Yes         A  OrderCategoryC
58        9        No                          
9        10       Yes         B  OrderCategoryA
19       10       Yes         B  OrderCategoryB

答案 3 :(得分:0)

根据另一个答案的建议,您希望melt进行一些额外的清理,然后合并:

id_cols = ['Cust_ID','OrderMade','OrderType']
new_df = df[df.OrderMade.eq('Yes')].melt(id_vars=id_cols, var_name='OrderCategory')


new_df[new_df['value'].ne('No')]
        .merge(df.loc[df.OrderMade.eq('No'), 
                      ['Cust_ID','OrderMade','OrderType']],
               how='outer')
        .drop('value',axis=1)

输出:

    Cust_ID OrderMade OrderType   OrderCategory
0        10       Yes         B  OrderCategoryA
1        10       Yes         B  OrderCategoryB
2         1       Yes         A  OrderCategoryB
3         7       Yes         A  OrderCategoryB
4         2       Yes         A  OrderCategoryC
5         3       Yes         B  OrderCategoryC
6         6       Yes         C  OrderCategoryC
7         6       Yes         C  OrderCategoryD
8         8       Yes         A  OrderCategoryC
9         4        No                       NaN
10        5        No                       NaN
11        9        No                       NaN