实际上,我的问题基于:
Is there a faster way to update dataframe column values based on conditions?
所以,数据应该是:
import pandas as pd
import io
t="""
AV4MdG6Ihowv-SKBN_nB DTP,FOOD
AV4Mc2vNhowv-SKBN_Rn Cash 1,FOOD
AV4MeisikOpWpLdepWy6 DTP,Bar
AV4MeRh6howv-SKBOBOn Cash 1,FOOD
AV4Mezwchowv-SKBOB_S DTOT,Bar
AV4MeB7yhowv-SKBOA5b DTP,Bar
"""
data_vec=pd.read_csv(io.StringIO(t),sep='\s{2,}',names=['id','source'])
data_vec
这是data_vec:
id source
0 AV4MdG6Ihowv-SKBN_nB DTP,FOOD
1 AV4Mc2vNhowv-SKBN_Rn Cash 1,FOOD
2 AV4MeisikOpWpLdepWy6 DTP,Bar
3 AV4MeRh6howv-SKBOBOn Cash 1,FOOD
4 AV4Mezwchowv-SKBOB_S DTOT,Bar
5 AV4MeB7yhowv-SKBOA5b DTP,Bar
如果我想得到如下结果:(这意味着如何对多重标签或类别进行矢量化?)
_id source_Cash 1 source_DTOT source_DTP Food Bar
0 AV4MdG6Ihowv-SKBN_nB 0 0 1 1 0
1 AV4Mc2vNhowv-SKBN_Rn 1 0 0 1 0
2 AV4MeisikOpWpLdepWy6 0 0 1 0 1
3 AV4MeRh6howv-SKBOBOn 1 0 0 1 0
4 AV4Mezwchowv-SKBOB_S 0 1 0 0 1
5 AV4MeB7yhowv-SKBOA5b 0 0 1 0 1
如果重复,请提醒我删除!
答案 0 :(得分:5)
一点str.split
和pd.get_dummies
魔法,inspired by Scott Boston和改进(来自原始版本)thanks to JohnE。
df = df.set_index('id').source.str.get_dummies(',')
df.columns = df.columns.str.split().str[0].str.lower()
df = df.add_prefix('source_').reset_index()
print(df)
id source_bar source_cash source_dtot source_dtp \
0 AV4MdG6Ihowv-SKBN_nB 0 0 0 1
1 AV4Mc2vNhowv-SKBN_Rn 0 1 0 0
2 AV4MeisikOpWpLdepWy6 1 0 0 1
3 AV4MeRh6howv-SKBOBOn 0 1 0 0
4 AV4Mezwchowv-SKBOB_S 1 0 1 0
5 AV4MeB7yhowv-SKBOA5b 1 0 0 1
source_food
0 1
1 1
2 0
3 1
4 0
5 0
答案 1 :(得分:1)
你也可以这样做: 我正在做的是拆分“源”列并创建新行。然后我在源列上调用get_dummies,然后按“id”列进行分组。
data_vec = pd.DataFrame(pd.concat([pd.Series(row['id'], row['source'].split(','))
for _, row in data_vec.iterrows()])).reset_index()
data_vec.columns = ['source','id']
给出:
source id
0 DTP AV4MdG6Ihowv-SKBN_nB
1 FOOD AV4MdG6Ihowv-SKBN_nB
2 Cash 1 AV4Mc2vNhowv-SKBN_Rn
3 FOOD AV4Mc2vNhowv-SKBN_Rn
4 DTP AV4MeisikOpWpLdepWy6
5 Bar AV4MeisikOpWpLdepWy6
6 Cash 1 AV4MeRh6howv-SKBOBOn
7 FOOD AV4MeRh6howv-SKBOBOn
8 DTOT AV4Mezwchowv-SKBOB_S
9 Bar AV4Mezwchowv-SKBOB_S
10 DTP AV4MeB7yhowv-SKBOA5b
11 Bar AV4MeB7yhowv-SKBOA5b
然后在源列上调用get_dummies():
result = pd.concat([data_vec.get(['id']),
pd.get_dummies(data_vec['source'], prefix='source')],axis=1)
result.groupby('id').sum().reset_index()
给出了:
id source_Bar source_Cash 1 source_DTOT source_DTP source_FOOD
0 AV4Mc2vNhowv-SKBN_Rn 0 1 0 0 1
1 AV4MdG6Ihowv-SKBN_nB 0 0 0 1 1
2 AV4MeB7yhowv-SKBOA5b 1 0 0 1 0
3 AV4MeRh6howv-SKBOBOn 0 1 0 0 1
4 AV4MeisikOpWpLdepWy6 1 0 0 1 0
5 AV4Mezwchowv-SKBOB_S 1 0 1 0 0