按组获取前n个值无法正常工作

时间:2018-11-22 07:51:03

标签: r dataframe dplyr

请考虑以下数据帧头(10):

Consider the following head(10) of a dataframe

它是由以下dplyr代码生成的:

Fuller_list %>% 
 as.data.frame() %>% 
 select(from_infomap, topic) %>%
 add_count(from_infomap) %>% 
 filter(from_infomap %in% coms_keep) %>% 
 group_by(from_infomap) %>%
 add_count(topic) %>%
 top_n(10, nn) %>%
 head(10)

“ from_infomap”列中有36个不同的社区,“ topic”列中有47个不同的主题。按“ from_infomap”分组,对于前5个社区,每个社区的主题数如下:

enter image description here 我想显示每个社区中出现次数最多的10个主题,按降序排列。我尝试在此处使用

 group_by(from_infomap) %>%
 add_count(topic) %>%
 top_n(10, nn) 

但是,如果我将其绘制出来,它只会返回每个社区的前1个主题:

enter image description here

我不确定自己在做什么错。根据{{​​3}}堆栈溢出查询,对count进行加权的top_n(n,wt)函数应该可以工作,它应按社区对加权的前10个主题进行加权。

如果有人可以提出替代方案或指出我要去哪里错了,将不胜感激。抱歉,屏幕截图很小,因为它很大,因此无法在此处显示整个data.frame。

谢谢!

编辑:不带group_by,add_count和top_n的dput:

n <- Fuller_list %>% 
 as.data.frame() %>% 
 select(from_infomap, topic) %>%
 add_count(from_infomap) %>% 
 filter(from_infomap %in% coms_keep) %>% 
 group_by(from_infomap)

dput(head(n,10)):

structure(list(from_infomap = c(1L, 1L, 1L, 3L, 3L, 3L, 4L, 4L, 
4L, 4L), topic = c("KnysnaFire_thanks_wofire", "Abramjee_caperelief_operationsa", 
"Pick_n_Pay", "Plett_heavy_rain_snow", "Disasters_help_call", 
"KFM_disasters_discussion", "Pick_n_Pay", "Pick_n_Pay", "Pick_n_Pay", 
"Pick_n_Pay"), n = c(30512L, 30512L, 30512L, 6572L, 6572L, 6572L, 
5030L, 5030L, 5030L, 5030L)), row.names = c(NA, -10L), class = c("grouped_df", 
"tbl_df", "tbl", "data.frame"), vars = "from_infomap", drop = TRUE, indices = list(
    0:2, 3:5, 6:9), group_sizes = c(3L, 3L, 4L), biggest_group_size = 4L, labels = structure(list(
    from_infomap = c(1L, 3L, 4L)), row.names = c(NA, -3L), class = "data.frame", vars = "from_infomap", drop = TRUE))

应该通过将以下代码添加到上一个块中来重现问题:

  add_count(topic) %>%
  top_n(10,nn) %>%
  ungroup() %>% 
  ggplot(aes(x = fct_reorder(topic,nn),y = nn,fill = from_infomap))+
  geom_col(width = 1)+
  facet_wrap(~from_infomap, scales = "free")+
  coord_flip()+
  theme(plot.title = element_text("Central Players"), 
        plot.subtitle= element_text("Top 10 indegree centrality profiles of the 20 biggest communities.\n Excluding 'starburst' communities."),
        plot.caption = element_text("Source: Twitter"))+
  theme_few()

Halway解决方案:因此,使用@s_t建议的summary方法,我们有以下代码:

Fuller_list %>% 
  as.data.frame() %>% 
  add_count(from_infomap) %>%
  filter(from_infomap %in% coms_keep) %>% 
  group_by(from_infomap,topic) %>%   # group by the topic and community
  summarise(nn = n()) %>%            # count the mentioned arguments
  top_n(10, nn) %>%
  ungroup() %>%
  arrange(from_infomap, nn) %>%
  ggplot(aes(x = fct_reorder(topic,nn),y = nn,fill = from_infomap))+
  geom_col(width = 1)+
  facet_wrap(~from_infomap, scales = "free")+
  coord_flip()+
  theme(plot.title = element_text("Central Players"), 
        plot.subtitle= element_text("Top 10 indegree centrality profiles of the 20 biggest communities.\n Excluding 'starburst' communities."),
        plot.caption = element_text("Source: Twitter"))+
  theme_few()

这将产生:this

哪个是各个社区中正确的top_n(10)。出于所有实际目的,该图现在显示正确的数据。唯一剩下的问题是,安排不会按每个社区 的顺序对各个主题进行排序,而是对整体进行排序。次要问题仅在主题可以按社区排列的情况下才能改善。

1 个答案:

答案 0 :(得分:1)

如果我理解得很好,这可能会有所帮助,您希望计算每个社区中的主题,选择top(X),然后以递减的方式在各个方面进行绘制:

library(ggplot2)
library(dplyr)

data3 <-
  data2 %>%
  select(-n) %>%                     # remove useless column
  group_by(from_infomap,topic) %>%   # group by the topic and community
  summarise(nn = n()) %>%            # count the mentioned arguments
  top_n(5, nn)                       # take the top 5 in this case

现在我们按照here所述处理订单:

data4 <- data3 %>% 
         ungroup() %>%  
         arrange(from_infomap, nn) %>%  
         mutate(topic_r = row_number()) 

最后一个情节:

ggplot(data4, aes(topic_r, nn,fill = from_infomap)) + 
geom_col() +
facet_wrap(~ from_infomap, scales = "free") +
scale_x_continuous(  
                   breaks = d$topic_r,  
                   labels = d$topic
                  ) +
coord_flip()

enter image description here

我使用了一些虚假数据,例如:

data2 <- data.frame(from_infomap =floor(runif(200, 1,5)) ,
                    topic = sample(letters[1:20], 200, TRUE),
                    n = floor(runif(200, 10,50)) )

社区中的许多主题具有相同的编号,因此您不会仅看到5列。