个人(索引从0到5)在两个位置之间进行选择:A和B. 我的数据具有宽格式,其中包含因个人(ind_var)而异的特征以及仅因位置(location_var)而异的特征。
例如,我有:
In [281]:
df_reshape_test = pd.DataFrame( {'location' : ['A', 'A', 'A', 'B', 'B', 'B'], 'dist_to_A' : [0, 0, 0, 50, 50, 50], 'dist_to_B' : [50, 50, 50, 0, 0, 0], 'location_var': [10, 10, 10, 14, 14, 14], 'ind_var': [3, 8, 10, 1, 3, 4]})
df_reshape_test
Out[281]:
dist_to_A dist_to_B ind_var location location_var
0 0 50 3 A 10
1 0 50 8 A 10
2 0 50 10 A 10
3 50 0 1 B 14
4 50 0 3 B 14
5 50 0 4 B 14
变量'location'是个人选择的变量。 dist_to_A是从个人选择的位置到位置A的距离(与dist_to_B相同)
我希望我的数据有这种形式:
choice dist_S ind_var location location_var
0 1 0 3 A 10
0 0 50 3 B 14
1 1 0 8 A 10
1 0 50 8 B 14
2 1 0 10 A 10
2 0 50 10 B 14
3 0 50 1 A 10
3 1 0 1 B 14
4 0 50 3 A 10
4 1 0 3 B 14
5 0 50 4 A 10
5 1 0 4 B 14
其中choice == 1表示个人已选择该位置,dist_S表示距所选位置的距离。
我读到了.stack方法,但无法弄清楚如何在这种情况下应用它。 谢谢你的时间!
注意:这只是一个简单的例子。我正在寻找的数据集每个位置都有不同数量的位置和个体数量,所以我正在寻找一个灵活的解决方案,如果可能的话
答案 0 :(得分:6)
事实上,pandas有一个wide_to_long
命令可以方便地执行你想要做的事情。
df = pd.DataFrame( {'location' : ['A', 'A', 'A', 'B', 'B', 'B'],
'dist_to_A' : [0, 0, 0, 50, 50, 50],
'dist_to_B' : [50, 50, 50, 0, 0, 0],
'location_var': [10, 10, 10, 14, 14, 14],
'ind_var': [3, 8, 10, 1, 3, 4]})
df['ind'] = df.index
#The `location` and `location_var` corresponds to the choices,
#record them as dictionaries and drop them
#(Just realized you had a cleaner way, copied from yous).
ind_to_loc = dict(df['location'])
loc_dict = dict(df.groupby('location').agg(lambda x : int(np.mean(x)))['location_var'])
df.drop(['location_var', 'location'], axis = 1, inplace = True)
# now reshape
df_long = pd.wide_to_long(df, ['dist_to_'], i = 'ind', j = 'location')
# use the dictionaries to get variables `choice` and `location_var` back.
df_long['choice'] = df_long.index.map(lambda x: ind_to_loc[x[0]])
df_long['location_var'] = df_long.index.map(lambda x : loc_dict[x[1]])
print df_long.sort()
这会为您提供您要求的表格:
ind_var dist_to_ choice location_var
ind location
0 A 3 0 A 10
B 3 50 A 14
1 A 8 0 A 10
B 8 50 A 14
2 A 10 0 A 10
B 10 50 A 14
3 A 1 50 B 10
B 1 0 B 14
4 A 3 50 B 10
B 3 0 B 14
5 A 4 50 B 10
B 4 0 B 14
当然,如果这是您想要的,您可以生成一个带0
和1
的选择变量。
答案 1 :(得分:3)
我有点好奇为什么你喜欢它的格式。可能有更好的方法来存储您的数据。但是这里有。
In [137]: import numpy as np
In [138]: import pandas as pd
In [139]: df_reshape_test = pd.DataFrame( {'location' : ['A', 'A', 'A', 'B', 'B
', 'B'], 'dist_to_A' : [0, 0, 0, 50, 50, 50], 'dist_to_B' : [50, 50, 50, 0, 0,
0], 'location_var': [10, 10, 10, 14, 14, 14], 'ind_var': [3, 8, 10, 1, 3, 4]})
In [140]: print(df_reshape_test)
dist_to_A dist_to_B ind_var location location_var
0 0 50 3 A 10
1 0 50 8 A 10
2 0 50 10 A 10
3 50 0 1 B 14
4 50 0 3 B 14
5 50 0 4 B 14
In [141]: # Get the new axis separately:
In [142]: idx = pd.Index(df_reshape_test.index.tolist() * 2)
In [143]: df2 = df_reshape_test[['ind_var', 'location', 'location_var']].reindex(idx)
In [144]: print(df2)
ind_var location location_var
0 3 A 10
1 8 A 10
2 10 A 10
3 1 B 14
4 3 B 14
5 4 B 14
0 3 A 10
1 8 A 10
2 10 A 10
3 1 B 14
4 3 B 14
5 4 B 14
In [145]: # Swap the location for the second half
In [146]: # replace any 6 with len(df) / 2 + 1 if you have more rows.d
In [147]: df2['choice'] = [1] * 6 + [0] * 6 # may need to play with this.
In [148]: df2.iloc[6:].location.replace({'A': 'B', 'B': 'A'}, inplace=True)
In [149]: df2 = df2.sort()
In [150]: df2['dist_S'] = np.abs((df2.choice - 1) * 50)
In [151]: print(df2)
ind_var location location_var choice dist_S
0 3 A 10 1 0
0 3 B 10 0 50
1 8 A 10 1 0
1 8 B 10 0 50
2 10 A 10 1 0
2 10 B 10 0 50
3 1 B 14 1 0
3 1 A 14 0 50
4 3 B 14 1 0
4 3 A 14 0 50
5 4 B 14 1 0
5 4 A 14 0 50
它不会很好地概括,但可能有其他(更好的)方法来绕过丑陋的部分,比如生成选择col。
答案 2 :(得分:2)
好的,这比我预期的要花费更长的时间,但这是一个更通用的答案,适用于每个人的任意数量的选择。我确信有更简单的方法,所以如果有人可以为以下代码中的某些内容提供更好的内容,那将会很棒。
df = pd.DataFrame( {'location' : ['A', 'A', 'A', 'B', 'B', 'B'], 'dist_to_A' : [0, 0, 0, 50, 50, 50], 'dist_to_B' : [50, 50, 50, 0, 0, 0], 'location_var': [10, 10, 10, 14, 14, 14], 'ind_var': [3, 8, 10, 1, 3, 4]})
给出了
dist_to_A dist_to_B ind_var location location_var
0 0 50 3 A 10
1 0 50 8 A 10
2 0 50 10 A 10
3 50 0 1 B 14
4 50 0 3 B 14
5 50 0 4 B 14
然后我们这样做:
df.index.names = ['ind']
# Add choice var
df['choice'] = 1
# Create dictionaries we'll use later
ind_to_loc = dict(df['location'])
# gives ind_to_loc equal to {0 : 'A', 1 : 'A', 2 : 'A', 3 : 'B', 4 : 'B', 5: 'B'}
ind_dict = dict(df['ind_var'])
#gives { 0: 3, 1 : 8, 2 : 10, 3: 1, 4 : 3, 5: 4}
loc_dict = dict( df.groupby('location').agg(lambda x : int(np.mean(x)) )['location_var'] )
# gives {'A' : 10, 'B' : 14}
现在我创建一个多索引并进行重新索引以获得长形
df = df.set_index( [df.index, df['location']] )
df.index.names = ['ind', 'location']
# re-index to long shape
loc_list = ['A', 'B']
ind_list = [0, 1, 2, 3, 4, 5]
new_shape = [ (ind, loc) for ind in ind_list for loc in loc_list]
idx = pd.Index(new_shape)
df_long = df.reindex(idx, method = None)
df_long.index.names = ['ind', 'loc']
长形看起来像这样:
dist_to_A dist_to_B ind_var location location_var choice
ind loc
0 A 0 50 3 A 10 1
B NaN NaN NaN NaN NaN NaN
1 A 0 50 8 A 10 1
B NaN NaN NaN NaN NaN NaN
2 A 0 50 10 A 10 1
B NaN NaN NaN NaN NaN NaN
3 A NaN NaN NaN NaN NaN NaN
B 50 0 1 B 14 1
4 A NaN NaN NaN NaN NaN NaN
B 50 0 3 B 14 1
5 A NaN NaN NaN NaN NaN NaN
B 50 0 4 B 14 1
现在用字典填充NaN值:
df_long['ind_var'] = df_long.index.map(lambda x : ind_dict[x[0]] )
df_long['location'] = df_long.index.map(lambda x : ind_to_loc[x[0]] )
df_long['location_var'] = df_long.index.map(lambda x : loc_dict[x[1]] )
# Fill in choice
df_long['choice'] = df_long['choice'].fillna(0)
最后,剩下的就是创造dist_S
我会在这里作弊并假设我可以像这样创建一个嵌套字典
nested_loc = {'A' : {'A' : 0, 'B' : 50}, 'B' : {'A' : 50, 'B' : 0}}
(这是:如果您在位置A,那么位置A位于0 km,位置B位于50 km)
def nested_f(x):
return nested_loc[x[0]][x[1]]
df_long = df_long.reset_index()
df_long['dist_S'] = df_long[['loc', 'location']].apply(nested_f, axis=1)
df_long = df_long.drop(['dist_to_A', 'dist_to_B', 'location'], axis = 1 )
df_long
给出了期望的结果
ind loc ind_var location_var choice dist_S
0 0 A 3 10 1 0
1 0 B 3 14 0 50
2 1 A 8 10 1 0
3 1 B 8 14 0 50
4 2 A 10 10 1 0
5 2 B 10 14 0 50
6 3 A 1 10 0 50
7 3 B 1 14 1 0
8 4 A 3 10 0 50
9 4 B 3 14 1 0
10 5 A 4 10 0 50
11 5 B 4 14 1 0