大熊猫根据复杂条件提取行

时间:2016-12-11 20:26:06

标签: python pandas

我有这个数据框:

            source         target  weight
24517        class         social      31
24356        class    proletariat      29
16189   bourgeoisi          class      29
24519        class        societi      29
24710        class           work      28
15375   bourgeoisi          class      26
23724        class         condit      24
24314        class          polit      24

...

如何创建包含以下内容的新数据框:

            source         target  weight
24517        class         social      31 # because it's the strongest pair for 'class'
24356        class    proletariat      29 # bc it's the strongest for 'proletariat'
16189   bourgeoisi          class      29 # bc strongest for 'bourgeoisi'
24519        class        societi      29 # bc strongest for 'societi'
24710        class           work      28 # bc strongest for 'work'

但不是例如:

15375     bourgeoisi          class      26 # bc it is not the strongest pair for either 'bourgeoisi' or 'class'

...

用于测试的替代数据框。该代码应该放弃第三行(索引8):

0     ape    dog       3
1     ape   hors       3
8     dog   hors       2
2     ape     la       1

4 个答案:

答案 0 :(得分:0)

你可以尝试:

import pandas as pd

data = pd.DataFrame({"source": ["class", "class", "bourgeoisi",
                            "class", "class", "bourgeoisi",
                            "class", "class"],
                 "target": ["social", "proletariat", "class",
                            "societi", "work", "class", "condit",
                            "polit"],
                 "weight": [31, 29, 29, 29, 28, 26, 24, 24]})

grouped = data.groupby(['source', 'target']).max().sort_values('weight', ascending=False).reset_index()

结果:

       source       target  weight
0       class       social      31
1  bourgeoisi        class      29
2       class  proletariat      29
3       class      societi      29
4       class         work      28
5       class       condit      24
6       class        polit      24

说明:我们按源和目标对记录进行分组,并选择每组中具有最大权重的记录,然后按权重按降序排序值,最后重置索引以将源和目标重新分配到列中。

[编辑] 根据示例,我认为您只需按“目标”进行分组:

data2 = pd.DataFrame({"source": ["ape", "ape", "dog", "ape"],
                      "target": ["dog", "hors", "hors", "la"],
                      "weight": [3, 3, 2, 1]})

grouped = data2.groupby(['target']).max().sort_values('weight', ascending=False).reset_index()
grouped = grouped[data2.columns.tolist()]  # bring back the column order
print(grouped)

结果:

  source target  weight
0    ape    dog       3
1    dog   hors       3
2    ape     la       1

答案 1 :(得分:0)

你可以试试这个替代方案:

result = data.sort_values('weight').groupby(['source','target'])['weight'].apply(lambda x: x.iloc[-1]).reset_index()

这是做什么的:

  1. ['source','target']
  2. 对所有行进行分组
  3. 在每个结果组中,按'weight'
  4. 排序
  5. 选择每个组的最后一个条目,因为默认情况下,排序从较小到较大的值。
  6. 重置索引以将['source','target']作为列而不是索引。
  7. 去喝咖啡休息时间,想一想下一个任务。

    我希望这会有所帮助。

答案 2 :(得分:0)

您的数据框基本上代表图表的加权边缘列表,您希望查找分布在两个单独列中的所有节点的最大权重行,以便使用groupby()中的pandas语法,您需要通过重新整形数据框或复制和连接将节点聚合到一个列中,这是一个连接选项:

idx = (pd.concat([df, df.rename(columns={'source': 'target', 'target': 'source'})])
       # switch the source and target columns and concatenate to the original data frame

       .groupby('source', group_keys=False)
       # now the source column contains all the nodes, you can group by it

       .apply(lambda g: g.weight == g.weight.max())[lambda x: x].index.drop_duplicates())
       # for each node find out the max weight rows index for subsetting/filtering

df.loc[idx,:]
#           source       target   weight
#16189  bourgeoisi        class       29          # max for bourgeoisi
#24517       class       social       31          # max for class and social
#23724       class       condit       24          # max for condit
#24314       class        polit       24          # max for polit
#24356       class  proletariat       29          # max for proletariat
#24519       class      societi       29          # max for societi
#24710       class         work       28          # max for work

答案 3 :(得分:0)

为了好玩 - 以下是一行代码中的答案

pd.concat([df.assign(var_type = lambda x: x['source']),
           df.assign(var_type = lambda x: x['target'])])\
  .sort_values(['var_type', 'weight'], ascending=[True, False])\
  .groupby('var_type')\
  .first().reset_index(drop=True)

您可以先为每行添加一个唯一ID,然后使用melt将源和目标放在一列中,然后将源和目标重新附加到长表中。这个长桌现在可以轻松获得最大值

df['id'] = np.arange(len(df))
df1 = pd.melt(df, id_vars=['id', 'weight'], var_name='var_type', value_name='label')
df2 = df1.merge(df[['id', 'source', 'target']], on='id')
df3 = df2.sort_values(['label', 'weight'], ascending=[True, False])
df3.groupby(['label']).first().reset_index(drop=True)[['source', 'target', 'weight']]

       source       target  weight
0  bourgeoisi        class      29
1       class       social      31
2       class       condit      24
3       class        polit      24
4       class  proletariat      29
5       class       social      31
6       class      societi      29
7       class         work      28