我有这个数据框:
source target weight
24517 class social 31
24356 class proletariat 29
16189 bourgeoisi class 29
24519 class societi 29
24710 class work 28
15375 bourgeoisi class 26
23724 class condit 24
24314 class polit 24
...
如何创建包含以下内容的新数据框:
source target weight
24517 class social 31 # because it's the strongest pair for 'class'
24356 class proletariat 29 # bc it's the strongest for 'proletariat'
16189 bourgeoisi class 29 # bc strongest for 'bourgeoisi'
24519 class societi 29 # bc strongest for 'societi'
24710 class work 28 # bc strongest for 'work'
但不是例如:
15375 bourgeoisi class 26 # bc it is not the strongest pair for either 'bourgeoisi' or 'class'
...
用于测试的替代数据框。该代码应该放弃第三行(索引8):
0 ape dog 3
1 ape hors 3
8 dog hors 2
2 ape la 1
答案 0 :(得分:0)
你可以尝试:
import pandas as pd
data = pd.DataFrame({"source": ["class", "class", "bourgeoisi",
"class", "class", "bourgeoisi",
"class", "class"],
"target": ["social", "proletariat", "class",
"societi", "work", "class", "condit",
"polit"],
"weight": [31, 29, 29, 29, 28, 26, 24, 24]})
grouped = data.groupby(['source', 'target']).max().sort_values('weight', ascending=False).reset_index()
结果:
source target weight
0 class social 31
1 bourgeoisi class 29
2 class proletariat 29
3 class societi 29
4 class work 28
5 class condit 24
6 class polit 24
说明:我们按源和目标对记录进行分组,并选择每组中具有最大权重的记录,然后按权重按降序排序值,最后重置索引以将源和目标重新分配到列中。
[编辑] 根据示例,我认为您只需按“目标”进行分组:
data2 = pd.DataFrame({"source": ["ape", "ape", "dog", "ape"],
"target": ["dog", "hors", "hors", "la"],
"weight": [3, 3, 2, 1]})
grouped = data2.groupby(['target']).max().sort_values('weight', ascending=False).reset_index()
grouped = grouped[data2.columns.tolist()] # bring back the column order
print(grouped)
结果:
source target weight
0 ape dog 3
1 dog hors 3
2 ape la 1
答案 1 :(得分:0)
你可以试试这个替代方案:
result = data.sort_values('weight').groupby(['source','target'])['weight'].apply(lambda x: x.iloc[-1]).reset_index()
这是做什么的:
['source','target']
'weight'
['source','target']
作为列而不是索引。去喝咖啡休息时间,想一想下一个任务。
我希望这会有所帮助。
答案 2 :(得分:0)
您的数据框基本上代表图表的加权边缘列表,您希望查找分布在两个单独列中的所有节点的最大权重行,以便使用groupby()
中的pandas
语法,您需要通过重新整形数据框或复制和连接将节点聚合到一个列中,这是一个连接选项:
idx = (pd.concat([df, df.rename(columns={'source': 'target', 'target': 'source'})])
# switch the source and target columns and concatenate to the original data frame
.groupby('source', group_keys=False)
# now the source column contains all the nodes, you can group by it
.apply(lambda g: g.weight == g.weight.max())[lambda x: x].index.drop_duplicates())
# for each node find out the max weight rows index for subsetting/filtering
df.loc[idx,:]
# source target weight
#16189 bourgeoisi class 29 # max for bourgeoisi
#24517 class social 31 # max for class and social
#23724 class condit 24 # max for condit
#24314 class polit 24 # max for polit
#24356 class proletariat 29 # max for proletariat
#24519 class societi 29 # max for societi
#24710 class work 28 # max for work
答案 3 :(得分:0)
为了好玩 - 以下是一行代码中的答案
pd.concat([df.assign(var_type = lambda x: x['source']),
df.assign(var_type = lambda x: x['target'])])\
.sort_values(['var_type', 'weight'], ascending=[True, False])\
.groupby('var_type')\
.first().reset_index(drop=True)
您可以先为每行添加一个唯一ID,然后使用melt将源和目标放在一列中,然后将源和目标重新附加到长表中。这个长桌现在可以轻松获得最大值
df['id'] = np.arange(len(df))
df1 = pd.melt(df, id_vars=['id', 'weight'], var_name='var_type', value_name='label')
df2 = df1.merge(df[['id', 'source', 'target']], on='id')
df3 = df2.sort_values(['label', 'weight'], ascending=[True, False])
df3.groupby(['label']).first().reset_index(drop=True)[['source', 'target', 'weight']]
source target weight
0 bourgeoisi class 29
1 class social 31
2 class condit 24
3 class polit 24
4 class proletariat 29
5 class social 31
6 class societi 29
7 class work 28