来自两个表格文件:
FILE1.TXT
name1 house1
name2 house1
name3 house1
name4 house2
name5 house2
name6 house2
和file2.txt
name1 car
name2 bike
name3 skate
name4 car
name5 motorcycle
name6 boat
我想使用两个字典中的信息来创建这样的存在/不存在矩阵。
car motorcycle bike boat skate
house1 1 0 1 0 1
house2 1 1 0 1 0
这是我的代码:
import pandas as pd
with open('file1.txt', 'r') as file1:
col_names = ['name', 'house']
df1 = pd.read_csv(file1, sep='\t', header=None, names=col_names)
with open('file2.txt', 'r') as file2:
col_names = ['name', 'transport']
df2 = pd.read_csv(file2, sep='\t', header=None, names=col_names)
# include the values from df1 into the df2 creating a new column
df2['house'] = df2['name'].map(df1.set_index('name')['house'])
g = df2.groupby('house')['transport'].apply(list).reset_index()
g.join(pd.get_dummies(g['transport'].apply(pd.Series).stack()).sum(level=0)).drop('transport', 1)
print g
这样做,我获得了这个输出:
house transport
0 house1 [car, bike, skate]
1 house2 [car, motorcycle, boat]
答案 0 :(得分:0)
这是一种方式。
<强>设置强>
# df1
individual house
0 name1 house1
1 name2 house1
2 name3 house1
3 name4 house2
4 name5 house2
5 name6 house2
# df2
individual transport
0 name1 car
1 name2 bike
2 name3 skate
3 name4 car
4 name5 motorcycle
5 name6 boat
<强>解决方案强>
df2['house'] = df2['individual'].map(df1.set_index('individual')['house'])
g = df2.groupby('house')['transport'].apply(list).reset_index()
g.join(pd.get_dummies(g['transport']\
.apply(pd.Series).stack()).sum(level=0)).drop('transport', 1)
<强>结果强>
house bike boat car motorcycle skate
0 house1 1 0 1 0 1
1 house2 0 1 1 1 0
<强>解释强>
有3个步骤:
df2
,从df1
。pd.get_dummies
将transports
列扩展为虚拟列。