我有一个这样的数据框。
import pandas as pd
from collections import OrderedDict
have = pd.DataFrame(OrderedDict({'User':['101','101','102','102','103','103','103'],
'Name':['A','A','B','B','C','C','C'],
'Country':['India','UK','US','UK','US','India','UK'],
'product':['Soaps','Brush','Soaps','Brush','Soaps','Brush','Brush'],
'channel':['Retail','Online','Retail','Online','Retail','Online','Online'],
'Country_flag':['Y','Y','N','Y','N','N','Y'],
'product_flag':['N','Y','Y','Y','Y','N','N'],
'channel_flag':['N','N','N','Y','Y','Y','Y']
}))
我想基于标志创建新列。 如果用户具有标志Y,那么我想合并这些相应的记录。
在下面的图像中,第一个记录用户仅在国家/地区上具有标记Y我想创建新的ctry列,并且该值应类似地在第二个记录国家/地区连接(用户| name | country),并且产品具有Y,然后ctry_prod列和值并置(用户|名称|国家|产品)等
想要的输出:
答案 0 :(得分:1)
我的看法:
# columns of interest
cat_cols = ['Country', 'product', 'channel']
flag_cols = [col+'_flag' for col in cat_cols]
# select those values marked 'Y'
s = (have[cat_cols].where(have[flag_cols].eq('Y').values)
.stack()
.reset_index(level=1)
)
# join columns and values by |
s = s.groupby(s.index).agg('|'.join)
# add the 'User' and 'Name'
s[0] = have['User'] + "|" + have['Name'] + "|" + s[0]
# unstack to turn `level_1` to columns
s = s.reset_index().set_index(['index','level_1'])[0].unstack()
# concat by rows
pd.concat((have,s), axis=1)
输出:
+----+--------+--------+-----------+-----------+-----------+----------------+----------------+----------------+-------------+-------------------+-------------------+---------------------------+--------------+-------------+--------------------+
| | User | Name | Country | product | channel | Country_flag | product_flag | channel_flag | Country | Country|channel | Country|product | Country|product|channel | channel | product | product|channel |
|----+--------+--------+-----------+-----------+-----------+----------------+----------------+----------------+-------------+-------------------+-------------------+---------------------------+--------------+-------------+--------------------|
| 0 | 101 | A | India | Soaps | Retail | Y | N | N | 101|A|India | nan | nan | nan | nan | nan | nan |
| 1 | 101 | A | UK | Brush | Online | Y | Y | N | nan | nan | 101|A|UK|Brush | nan | nan | nan | nan |
| 2 | 102 | B | US | Soaps | Retail | N | Y | N | nan | nan | nan | nan | nan | 102|B|Soaps | nan |
| 3 | 102 | B | UK | Brush | Online | Y | Y | Y | nan | nan | nan | 102|B|UK|Brush|Online | nan | nan | nan |
| 4 | 103 | C | US | Soaps | Retail | N | Y | Y | nan | nan | nan | nan | nan | nan | 103|C|Soaps|Retail |
| 5 | 103 | C | India | Brush | Online | N | N | Y | nan | nan | nan | nan | 103|C|Online | nan | nan |
| 6 | 103 | C | UK | Brush | Online | Y | N | Y | nan | 103|C|UK|Online | nan | nan | nan | nan | nan |
+----+--------+--------+-----------+-----------+-----------+----------------+----------------+----------------+-------------+-------------------+-------------------+---------------------------+--------------+-------------+--------------------+
答案 1 :(得分:0)
这是一个很难的问题
s1=have.iloc[:,-3:]
#filtr the columns
s2=have.iloc[:,2:-3]
#filtr the columns
s2=s2.where((s1=='Y').values,np.nan)
#mask the name by it condition , if Y replace it as NaN
s3=pd.concat([have.iloc[:,:2],s2],1).stack().groupby(level=0).agg('|'.join)
#make the series you need
s1=s1.eq('Y').dot(s1.columns+'_').str.strip('_')
#Using dot get the column name for additional columns
s=pd.crosstab(values=s3,index=have.index,columns=s1,aggfunc='first').fillna(0)
#convert it by using crosstab
df=pd.concat([have,s],axis=1)
df
Out[175]:
User Name Country ... channel_flag product_flag product_flag_channel_flag
0 101 A India ... 0 0 0
1 101 A UK ... 0 0 0
2 102 B US ... 0 102|B|Soaps 0
3 102 B UK ... 0 0 0
4 103 C US ... 0 0 103|C|Soaps| Retail
5 103 C India ... 103|C|Online 0 0
6 103 C UK ... 0 0 0
[7 rows x 15 columns]
答案 2 :(得分:0)
不是很优雅,但是可以使用。为了清楚起见,我将循环和if语句保留在多行中:
have['Linked_Flags'] = have['Country_flag'] + have['product_flag'] + have['channel_flag']
mapping = OrderedDict([('YNN', 'ctry'), ('NYN', 'prod'), ('NNY', 'chnl'), ('YYY', 'ctry_prod_channel'),('YYN', 'ctry_prod'), ('YNY', 'ctry_channel'), ('NYY', 'prod_channel')])
string_to_add_dict = {0: 'Country', 1: 'product', 2: 'channel'}
for linked_flag in mapping.keys():
string_to_add = ''
for position, letter in enumerate(linked_flag):
if letter == 'Y':
string_to_add += have[string_to_add_dict[position]] + '| '
have[mapping[linked_flag]] = np.where(have['Linked_Flags'] == linked_flag, have['User'] + '|' + have['Name'] + '|' + string_to_add, '')
del have['Linked_Flags']