给出以下数据框
data = [[1, 'Yes','A','No','Yes','No','No','No'],
[2, 'Yes','A','No','No','Yes','No','No'],
[3, 'Yes','B','No','No','Yes','No','No'],
[4, 'No','','','','','',''],
[5, 'No','','','','','',''],
[6, 'Yes','C','No','No','Yes','Yes','No'],
[7, 'Yes','A','No','Yes','No','No','No'],
[8, 'Yes','A','No','No','Yes','No','No'],
[9, 'No','','','','','',''],
[10, 'Yes','B','Yes','Yes','No','No','No']]
df = pd.DataFrame(data,columns=['Cust_ID','OrderMade','OrderType','OrderCategoryA','OrderCategoryB','OrderCategoryC','OrderCategoryD'])
+----+-----------+-------------+-------------+------------------+------------------+------------------+------------------+
| | Cust_ID | OrderMade | OrderType | OrderCategoryA | OrderCategoryB | OrderCategoryC | OrderCategoryD |
|----+-----------+-------------+-------------+------------------+------------------+------------------+------------------|
| 0 | 1 | Yes | A | No | Yes | No | No |
| 1 | 2 | Yes | A | No | No | Yes | No |
| 2 | 3 | Yes | B | No | No | Yes | No |
| 3 | 4 | No | | | | | |
| 4 | 5 | No | | | | | |
| 5 | 6 | Yes | C | No | No | Yes | Yes |
| 6 | 7 | Yes | A | No | Yes | No | No |
| 7 | 8 | Yes | A | No | No | Yes | No |
| 8 | 9 | No | | | | | |
| 9 | 10 | Yes | B | Yes | Yes | No | No |
+----+-----------+-------------+-------------+------------------+------------------+------------------+------------------+
如何将其转换为基于OrderCategory
的行?
+--------+-----------+----------+----------------+
|Cust_ID | OrderMade |OrderType | OrderCategory |
|--------+-----------+----------+----------------|
|1 | Yes | A | OrderCategoryB |
|2 | Yes | A | OrderCategoryC |
|3 | Yes | B | OrderCategoryC |
|4 | No | | |
|5 | No | | |
|6 | Yes | C | OrderCategoryC |
|6 | Yes | C | OrderCategoryD |
|7 | Yes | A | OrderCategoryB |
|8 | Yes | A | OrderCategoryC |
|9 | No | | |
|10 | Yes | B | OrderCategoryA |
|10 | Yes | B | OrderCategoryB |
+--------+-----------+----------+----------------+
我尝试使用crosstab
以一个OrderCategory
开头,并计划为每个类别重复一次,但这似乎效率不高,而且我不确定如何继续获得期望的结果。
imgCROSS = pd.crosstab(df["Cust_ID"], df["OrderCategoryA"])
返回...
OrderCategoryA No Yes
Cust_ID
1 0 1 0
2 0 1 0
3 0 1 0
4 1 0 0
5 1 0 0
6 0 1 0
7 0 1 0
8 0 1 0
9 1 0 0
10 0 0 1
我还认为我可以填充一个名为Category
的新空列并遍历每一行,并根据Yes/No
值填充适当的类别,但这不适用于具有多个行的行。类别。另外,此想法的以下实现返回一个空列。
imgRaw["Category"] = ""
for index, row in df.iterrows():
catA = row["OrderCategoryA"]
catB = row["OrderCategoryB"]
catC = row["OrderCategoryC"]
catD = row["OrderCategoryD"]
if catA == "Yes":
row["Category"] = "OrderCategoryA"
elif catB == "Yes":
row["Category"] = "OrderCategoryB"
elif catC == "Yes":
row["Category"] = "OrderCategoryC"
elif catD == "Yes":
row["Category"] = "OrderCategoryD"
我知道我需要转换数据框,可能要多次才能获得所需的结果。只是停留在如何进行。
答案 0 :(得分:3)
让我们分四个步骤使用熊猫:
df_1 = df.set_index(['Cust_ID', 'OrderMade', 'OrderType'])
df_2 = df_1.where((df_1 == "Yes") | (df_1 == "")).rename_axis('OrderCategory', axis=1).stack().reset_index()
df_2['OrderCategory'] = df_2['OrderCategory'].mask(df_2['OrderMade'] == 'No','')
df_2.drop_duplicates().drop(0, axis=1)
输出:
Cust_ID OrderMade OrderType OrderCategory
0 1 Yes A OrderCategoryB
1 2 Yes A OrderCategoryC
2 3 Yes B OrderCategoryC
3 4 No
8 5 No
13 6 Yes C OrderCategoryC
14 6 Yes C OrderCategoryD
15 7 Yes A OrderCategoryB
16 8 Yes A OrderCategoryC
17 9 No
22 10 Yes B OrderCategoryA
23 10 Yes B OrderCategoryB
答案 1 :(得分:1)
这是一种实现方法(我必须修改您的原始数据框,以便它只有一个OrderCategoryD而不是两个...希望是一个错字):
keep_cols = ['Cust_ID','OrderMade','OrderType']
build = pd.DataFrame()
for col in df.columns:
if 'OrderCategory' in col:
cat = col[-1:] # Get the category letter
temp = df.loc[df[col] == 'Yes', keep_cols] # Get all the rows with a yes in this column
temp['OrderCategory'] = cat # Append a column with the correct letter
build = build.append(temp) # Append that df to our new df
# Once that's done, get all the rows that have a 'No' in the OrderMade column
final = pd.merge(build, df[keep_cols], how='right').sort_values('Cust_ID')
final = final.reset_index().drop(columns=['index'])
答案 2 :(得分:1)
'No'
中的'OrderMade'
这使问题泛化,使我们能够使用更统一的方法。
d = df.assign(**{'': df.OrderMade.map({'Yes': 'No', 'No': 'Yes'})})
ids, cat = np.split(d, [3], 1) # split between 3rd and 4th columns
i, j = np.where(cat.eq('Yes'))
ids.iloc[i].assign(OrderCategory=cat.columns[j])
Cust_ID OrderMade OrderType OrderCategory
0 1 Yes A OrderCategoryB
1 2 Yes A OrderCategoryC
2 3 Yes B OrderCategoryC
3 4 No
4 5 No
5 6 Yes C OrderCategoryC
5 6 Yes C OrderCategoryD
6 7 Yes A OrderCategoryB
7 8 Yes A OrderCategoryC
8 9 No
9 10 Yes B OrderCategoryA
9 10 Yes B OrderCategoryB
melt
添加柱子也简化了熔融
d = df.assign(**{'': df.OrderMade.map({'Yes': 'No', 'No': 'Yes'})})
d.melt(['Cust_ID', 'OrderMade', 'OrderType'], var_name='OrderCategory') \
.query('value == "Yes"').drop('value', 1).sort_values('Cust_ID')
Cust_ID OrderMade OrderType OrderCategory
10 1 Yes A OrderCategoryB
21 2 Yes A OrderCategoryC
22 3 Yes B OrderCategoryC
53 4 No
54 5 No
25 6 Yes C OrderCategoryC
35 6 Yes C OrderCategoryD
16 7 Yes A OrderCategoryB
27 8 Yes A OrderCategoryC
58 9 No
9 10 Yes B OrderCategoryA
19 10 Yes B OrderCategoryB
答案 3 :(得分:0)
根据另一个答案的建议,您希望melt
进行一些额外的清理,然后合并:
id_cols = ['Cust_ID','OrderMade','OrderType']
new_df = df[df.OrderMade.eq('Yes')].melt(id_vars=id_cols, var_name='OrderCategory')
new_df[new_df['value'].ne('No')]
.merge(df.loc[df.OrderMade.eq('No'),
['Cust_ID','OrderMade','OrderType']],
how='outer')
.drop('value',axis=1)
输出:
Cust_ID OrderMade OrderType OrderCategory
0 10 Yes B OrderCategoryA
1 10 Yes B OrderCategoryB
2 1 Yes A OrderCategoryB
3 7 Yes A OrderCategoryB
4 2 Yes A OrderCategoryC
5 3 Yes B OrderCategoryC
6 6 Yes C OrderCategoryC
7 6 Yes C OrderCategoryD
8 8 Yes A OrderCategoryC
9 4 No NaN
10 5 No NaN
11 9 No NaN