Question

我有一个像这样的数据帧（最小可重现的例子）：

 Search_Term  Exit_Pages      Ratio_x Date_x   Ratio_y Date_y
 hello        /store/catalog  .20     8/30/17  .25     7/30/17
 hello        /store/product  .15     8/30/17  .10     7/30/17
 goodbye      /store/search   .35     8/30/17  .20     7/30/17
 goodbye      /store/product  .25     8/30/17  .40     7/30/17

我尝试做的是首先按搜索词进行分组，并且对于每个搜索词，找到大于Ratio_x和Ratio_y（同时保留数据帧中的所有剩余列）。所以我希望看到的输出是：

Search_Term   Exit_Pages  Ratio_x   Date_x   Ratio_y  Date_y  Highest_Ratio

 hello        /store/catalog  .20     8/30/17  .25     7/30/17  .25
 hello        /store/product  .15     8/30/17  .10     7/30/17
 goodbye      /store/search   .35     8/30/17  .20     7/30/17
 goodbye      /store/product  .25     8/30/17  .40     7/30/17  .40

我尝试做的是使用Search_Term创建一个group并使用apply应用以下两个列函数中的大部分（我之后将此数据帧加入到我的原始数据中以包含上面的值，但是错误消息阻止我做那一步）：

def Greater(Merge, maximumA, maximumB):
    a = Merge[maximumA]
    b = Merge[maximumB]
    return max(a,b)

Merger.groupby("Search_Term").apply(Greater, "Ratio_x","Ratio_y")

This gives me the error message: ValueError: The truth value of a Series is 
ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

我可以做些什么小修改来使我的代码工作，如果有的话，它会是什么？如果没有，究竟是什么问题以及如何解决这个问题呢？

Answer 1

也许您想要groupby + transform？

df['Highest_Ratio'] = df.groupby('Search_Term')\
            ['Ratio_x', 'Ratio_y'].transform('max').max(1)

df['Highest_Ratio']

0    0.25
1    0.25
2    0.40
3    0.40
Name: Highest_Ratio, dtype: float64

您可以使用np.where再执行一步以获得准确的输出：

m = df['Highest_Ratio'].eq(df['Ratio_x']) | df['Highest_Ratio'].eq(df['Ratio_y'])
df['Highest_Ratio'] = np.where(m, df['Highest_Ratio'], '')

df

  Search_Term      Exit_Pages  Ratio_x   Date_x  Ratio_y   Date_y  \
0       hello  /store/catalog     0.20  8/30/17     0.25  7/30/17   
1       hello  /store/product     0.15  8/30/17     0.10  7/30/17   
2     goodbye   /store/search     0.35  8/30/17     0.20  7/30/17   
3     goodbye  /store/product     0.25  8/30/17     0.40  7/30/17   

  Highest_Ratio  
0          0.25  
1                
2                
3           0.4

请记住，最好跳过此步骤，因为在性能方面混合字符串和浮点数并不是最佳选择。

pandas数据帧中groupby对象的两列中的大多数

1 个答案: