我有一个像这样的数据集:
category UK US Germany
sales 100000 48000 36000
budget 50000 20000 14000
n_employees 300 123 134
diversified 1 0 1
sustainability_score 22.8 38.9 34.5
e_commerce 37000 7000 11000
budget 25000 10000 10000
n_employees 18 22 7
traffic 150 mil 38 mil 12500
subsidy 33000 26000 23000
budget 14000 6000 6000
own_marketing 0 0 1
在数据集中,销售变量对应于总部的销售额。
e_commerce
是e-commerce
的销售,而budget
之后的e_commerce
实际上是公司e_commerce
部分的预算。 subsisdy
同样适用,subsidy
变量对应于subsidy
的销售额,而budget
是subsidy
的预算后的subsidy
变量。我想将数据集转换为这样的内容(如果以英国为例):
UK_main_sales UK_main_budget ... UK_e_commerce_sales UK_e_commerce_budget ...
100000 500000 37000 250000
,依此类推。我试图通过跟踪budget
变量来对来自不同部门的变量进行分类,因为变量总是在出发后立即出现,但我没有成功。
英国变量的完整列表应如下所示:
UK_main_sales
UK_main_budget
UK_main_n_employees
UK_main_diversified
UK_main_sustainability_score
UK_e_commerce (we could also add sales but I think it is simpler without sales)
UK_e_commerce_budget
UK_e_commerce_n_employees
UK_e_commerce_traffic
UK_subsidy
UK_subsidy_budget
UK_subsidy_own_marketing
有什么想法吗?
答案 0 :(得分:2)
我认为需要:
#get boolean mask for rows for split
mask = df['category'].isin(['subsidy', 'e_commerce'])
#create NaNs for non match values by where
#replace NaNs by forward fill, first NaNs replace by fillna
#create mask for match values by mask and replace by empty string
#join together
df['category'] = (df['category'].where(mask).ffill().fillna('main').mask(mask).fillna('')
+ '_' + df['category']).str.strip('_')
#reshape by unstack
df = df.set_index('category').unstack().to_frame().T
#flatten MultiIndex
df.columns = df.columns.map('_'.join)
print (df)
UK_main_sales UK_main_budget UK_main_n_employees UK_main_diversified \
0 100000 50000 300 1
UK_main_sustainability_score UK_e_commerce UK_e_commerce_budget \
0 22.8 37000 25000
UK_e_commerce_n_employees UK_e_commerce_traffic UK_subsidy \
0 18 150 mil 33000
Germany_main_n_employees \
0 ... 134
Germany_main_diversified Germany_main_sustainability_score \
0 1 34.5
Germany_e_commerce Germany_e_commerce_budget Germany_e_commerce_n_employees \
0 11000 10000 7
Germany_e_commerce_traffic Germany_subsidy Germany_subsidy_budget \
0 12500 23000 6000
Germany_subsidy_own_marketing
0 1
[1 rows x 36 columns]