Question

我有一个数据框 dat1

   Country Count
1      AUS     1
2       NZ     2
3       NZ     1
4      USA     3
5      AUS     1
6      IND     2
7      AUS     4
8      USA     2
9      JPN     5
10      CN     2

首先，我想对每个“国家”加上“计数”。然后，每个国家的前三个总计数应该与另一行“其他”相结合，这是不属于前三名的国家的总和。

因此预期结果将是：

    Country Count
1     AUS     6
2     JPN     5
3     USA     5
4     Others  7

我已尝试过以下代码，但无法弄清楚如何放置“其他”行。

dat1 %>%
    group_by(Country) %>%
    summarise(Count = sum(Count)) %>%
    arrange(desc(Count)) %>%
    top_n(3)

此代码目前提供：

    Country Count
1     AUS     6
2     JPN     5
3     USA     5

非常感谢任何帮助。

dat1 <- structure(list(Country = structure(c(1L, 5L, 5L, 6L, 1L, 3L, 
    1L, 6L, 4L, 2L), .Label = c("AUS", "CN", "IND", "JPN", "NZ", 
    "USA"), class = "factor"), Count = c(1L, 2L, 1L, 3L, 1L, 2L, 
    4L, 2L, 5L, 2L)), .Names = c("Country", "Count"), class = "data.frame",     row.names = c("1", 
    "2", "3", "4", "5", "6", "7", "8", "9", "10"))

Answer 1

而不是d1 <- aggregate(.~Country, dat1, FUN=sum) i1 <- order(-d1$Count) rbind(d1[i1,][1:3,], data.frame(Country='Others', Count=sum(d1$Count[i1][4:nrow(d1)])))，这似乎是方便函数top_n的一个好例子。它使用了tally，summarise和sum。

然后使用arrange创建一个＆＃34;其他＆＃34;类别。使用factor参数设置＆＃34;其他＆＃34;作为最后一个级别。＆＃34;其他＆＃34;然后将被放在表格的最后（以及结果的任何后续图表中）。

如果＆＃34;国家＆＃34;在原始数据中为levels，您可以将factor包裹在Country[1:3]中。

as.character

Answer 2

我们可以分两步完成：首先创建一个已排序的data.frame，然后rbind前三行，其中包含最后一行的摘要：

d <- df %>% group_by(Country) %>% summarise(Count = sum(Count)) %>% arrange(desc(Count))

rbind(top_n(d,3),
      slice(d,4:n()) %>% summarise(Country="other",Count=sum(Count))
      )

输出

  Country Count
   (fctr) (int)
1     AUS     6
2     JPN     5
3     USA     5
4   other     7

Answer 3

以下是使用data.table的选项。我们转换了＆＃39; data.frame＆＃39;到＆＃39; data.table＆＃39; （setDT(dat1)），按＆＃39;国家/地区分组，我们得到sum＆＃39; Count＆＃39;，然后order按＆＃39; Count＆＃39;，我们rbind前三个与list＆＃39;其他＆＃39}的观察结果和{＆＃39;计数＆＃39;其余的观察结果。

sum

或使用library(data.table) setDT(dat1)[, list(Count=sum(Count)), Country][order(-Count), rbind(.SD[1:3], list(Country='Others', Count=sum(.SD[[2]][4:.N]))) ] # Country Count #1: AUS 6 #2: USA 5 #3: JPN 5 #4: Others 7

base R

Answer 4

您甚至可以使用xtabs()并操纵结果。这是一个基本的答案。

s <- sort(xtabs(Count ~ ., dat1), decreasing = TRUE)
setNames(
    as.data.frame(as.table(c(head(s, 3), Others = sum(tail(s, -3)))), 
    names(dat1)
)
#   Country Count
# 1     AUS     6
# 2     JPN     5
# 3     USA     5
# 4  Others     7

Answer 5

有些人可能觉得有用的功能：

top_cases = function(v, top, other = 'other'){
  cv = class(v)
  v = as.character(v)
  v[factor(v, levels = top) %>% is.na()] = other
  if(cv == 'factor') v = factor(v, levels = c(top, other))
  v
}

E.g ..

> table(state.region)
state.region
    Northeast         South North Central          West 
            9            16            12            13 
> top_cases(state.region, c('South','West'), 'North') %>% table()
.
South  West North 
   16    13    21

iris %>% mutate(Species = top_cases(Species, c('setosa','versicolor')))

Answer 6

对于那些对超过一定百分比的类别放入“其他”的情况感兴趣的人。类别，这里有一些代码。

为此，任何小于5％的值都会进入其他＆＃39;其他＆＃39;类别，＆＃39;其他＆＃39;类别总和，它包括聚合到其他＆＃39;其他＆＃39;中的类别数量的标签。类别。

othernum <- nrow(sub[(sub$value<.05),])
sub<- subset(sub, value >.05)
toplot <- rbind(sub,c(paste("Other (",othernum," types)", sep=""), 1-sum(sub$value)))

Answer 7

您可以从fct_lump库中使用forcats

dat1 %>%
  group_by(fct_lump(Country, n = 3, w = Count)) %>%
  summarize(Count = sum(Count))

应该这样做，也可以使用other_level内的fct_lump参数来更改“其他”标签

将top_n的结果与dplyr中的“其他”类别组合

7 个答案: