Question

我有一个数据框，其中包含3个变量（document，topic和gamma）

document    topic   gamma
1            1      0.932581726
1            2      0.015250915
1            3      0.009929329
2            1      0.032864538
2            2      0.012939786
2            3      0.13281681

我想基于最高伽玛值创建一个包含文档主题值的向量。对于哪个主题伽马值很高，文档属于该主题。

我尝试了一些代码，但不确定这是否正确。

a2<-function(x){
  i=1
while(i< 110)
  for(j in 1:7)
    x= max(ap_documents$gamma)
  return(j)
  }
a3<-sapply(ap_documents,a2)

Answer 1

这是dplyr的一种方式：

library(dplyr)
df %>%
  group_by(document) %>%
  filter(gamma == max(gamma))
#output
# A tibble: 2 x 3
# Groups: document [2]
  document topic gamma
     <int> <int> <dbl>
1        1     1 0.933
2        2     3 0.133

在基础R中

，您可以使用aggregate：

aggregate(gamma ~ document, max, data = df)
#output
  document     gamma
1        1 0.9325817
2        2 0.1328168

如果您想保留主题列，可以将其合并回来：

merge(aggregate(gamma ~ document, max, data = df), df)
#output
  document     gamma topic
1        1 0.9325817     1
2        2 0.1328168     3

Answer 2

虽然其他解决方案运行良好，但我想在histogram中提及top_n - 函数，它是为解决类似任务而构建的：

dplyr

另一个简单的基础R解决方案也是：

library(dplyr)

my_df %>% 
  group_by(document) %>% 
  top_n(1, topic)

# A tibble: 2 x 3
# Groups:   document [2]
#   document topic   gamma
#      <int> <int>   <dbl>
# 1        1     3 0.00993
# 2        2     3 0.133

数据

my_df <- my_df[order(my_df$topic, decreasing = TRUE), ] my_df[!duplicated(my_df$document), ] # document topic gamma # 3 1 3 0.009929329 # 6 2 3 0.132816810

Answer 3

如果我理解你想要什么，你可以使用dplyr来完成它。

library(dplyr)

result <- df %>% 
    group_by(topic) %>% 
    slice(topic_gamma = which.max(gamma))

result
## A tibble: 2 x 3
## Groups:   document [2]
#  document topic gamma
#     <dbl> <dbl> <dbl>
#1       1.    1. 0.933
#2       2.    3. 0.133

根据最高伽马值创建包含文档主题值的向量

3 个答案: