Question

我现在在R里有点新手......所以我正在做一个大学项目的人口普查研究。用于插图，这是我的data.frame

的一部分

             MUN          X1990  X1991  X1992 X1993
1     Angra dos Reis (RJ)    11    10    10    10
2            Aperibé (RJ)    NA    NA    NA    NA
3           Araruama (RJ)  12040 14589 14231 14231
4              Areal (RJ)    NA    NA    NA     3
5 Armação dos Búzios (RJ)    NA    NA    NA    NA

我的问题是，我需要总结一些我知道/将要指定名称的市政行，（因为我不知道它将会出现的顺序，或者它们是否会出现在我的所有表格中），结果应该连续显示。

举个例子，我想将行“Areal”与行“Angra dos Reis”相加，结果存储在另一个创建的行中（让我们调用结果行：X）所以结果应该是：

             MUN          X1990  X1991  X1992 X1993
1     Angra dos Reis (RJ)    11    10    10    10
2            Aperibé (RJ)    NA    NA    NA    NA
3           Araruama (RJ)  12040 14589 14231 14231
4              Areal (RJ)    NA    NA    NA     3
5 Armação dos Búzios (RJ)    NA    NA    NA    NA
6          X                 11    10    10    13

我试图创建一个for循环和一个if循环，但我无法做到这一点。

Answer 1

这与Jaap的评论非常相似，但更多地说明并明确使用行名称：

mat = as.matrix(dat[, 2:5])
row.names(mat) = dat$MUN
mat = rbind(mat, colSums(mat[c("Angra dos Reis (RJ)", "Areal (RJ)"), ], na.rm = T))
row.names(mat)[nrow(mat)] = "X"
mat
#                         X1990 X1991 X1992 X1993
# Angra dos Reis (RJ)        11    10    10    10
# Aperibé (RJ)               NA    NA    NA    NA
# Araruama (RJ)           12040 14589 14231 14231
# Areal (RJ)                 NA    NA    NA     3
# Armação dos Búzios (RJ)    NA    NA    NA    NA
# X                          11    10    10    13

结果是matrix，如果需要，您可以将其转换回数据框：

dat_result = data.frame(MUN = row.names(mat), mat, row.names = NULL)

我不喜欢数据格式作为数据框。我会将其转换为矩阵（如上所述）或将其转换为长格式，例如tidyr::gather(dat, key = year, value = value, -MUN)并使用data.table或dplyr“按组”使用它。

使用此数据：

dat = read.table(text = "             MUN          X1990  X1991  X1992 X1993
1     'Angra dos Reis (RJ)'    11    10    10    10
2            'Aperibé (RJ)'    NA    NA    NA    NA
3           'Araruama (RJ)'  12040 14589 14231 14231
4              'Areal (RJ)'    NA    NA    NA     3
5 'Armação dos Búzios (RJ)'    NA    NA    NA    NA", header= T)

Answer 2

解决方案可以使用sqldf包。如果数据框的名称为df，则可以执行以下操作：

library(sqldf)
result <- sqldf("SELECT * FROM df UNION 
       SELECT 'X', SUM(X1990), SUM(X1991), SUM(X1992), SUM(X1993) FROM df
       WHERE MUN IN ('Angra dos Reis (RJ)', 'Areal (RJ)')")

Answer 3

以下是dplyr解决方案：

library(dplyr)
df %>%
  filter(MUN %in% c("Angra dos Reis (RJ)", "Areal (RJ)")) %>%
  summarize_if(is.numeric, sum, na.rm = TRUE) %>%
  as.list(.) %>%
  c(MUN = "X") %>%
  bind_rows(df, .)

<强>结果：

                      MUN X1990 X1991 X1992 X1993
1     Angra dos Reis (RJ)    11    10    10    10
2            Aperibé (RJ)    NA    NA    NA    NA
3           Araruama (RJ) 12040 14589 14231 14231
4              Areal (RJ)    NA    NA    NA     3
5 Armação dos Búzios (RJ)    NA    NA    NA    NA
6                       X    11    10    10    13

数据（来自@Gregor和stringsAsFactors = FALSE）：

df = read.table(text = "             MUN          X1990  X1991  X1992 X1993
                 1     'Angra dos Reis (RJ)'    11    10    10    10
                 2            'Aperibé (RJ)'    NA    NA    NA    NA
                 3           'Araruama (RJ)'  12040 14589 14231 14231
                 4              'Areal (RJ)'    NA    NA    NA     3
                 5 'Armação dos Búzios (RJ)'    NA    NA    NA    NA", header= T, stringsAsFactors = FALSE)

Answer 4

我假设您想要汇总您知道/指定名称的两个市政府的数据，然后在表格的末尾添加它们的总和。我不确定这种理解是否正确。您可能需要再次指定您的问题，以防下面的代码不符合您的要求（例如，如果您需要每次总结多个市镇，或者一次只需要两个市镇，等等）。

此外，如果您必须多次调用我提出的函数或者您的表格非常大，则需要在速度方面进行改进，例如，使用包data.table而不是基础R（因为你说你是初学者我坚持基础R）。

为了满足您保留NA值的请求，我已使用Joshua Ulrich建议的代码作为此问题rowSums but keeping NA values的答案。

data <- data.frame(MUN = c("Angra dos Reis (RJ)", "Aperibé (RJ)", "Araruama (RJ)", "Areal (RJ)", "Armação dos Búzios (RJ)")
               ,X1990 = c(11, NA, 12040, NA, NA)
               ,X1991 = c(10, NA, 14589, NA, NA)
               ,X1992 = c(10, NA, 14231, NA, NA)
               ,X1993 = c(10, NA, 12231, 3, NA)
)

sum_rows <- function(df, row1, row2) {

  #get the indices of the two rows to be summed
  #grep returns the position in a vector at which a certain element is stored
  #here the name of the municipality 
  index_row1 <-  grep(row1, df$MUN, fixed=T)
  index_row2 <-  grep(row2, df$MUN, fixed=T)

  #select the two rows of the data.frame that you want to sum
  #on basis of the entry in the MUN column
  #further only select the column with numbers for the sum operation
  #check if all entries in a single column are NA values
  #if yes then the ouput for this column is NA
  #if no calculate the column sum, if one entry is NA, ignore it
  sum <- ifelse(apply(is.na(df[c(index_row1, index_row2),2:ncol(df)]),2,all)
                      ,NA
                      ,colSums(df[c(index_row1, index_row2),2:ncol(df)],na.rm=TRUE)
               )

  #create a name entry for the new MUN column
  #paste0 is used to combine strings
  #in this case it might make sense to create a name 
  #that includes the indices of the rows that have been summed instad of only using X as name
  name <- paste0("Sum_R",index_row1,"_R" , index_row2)

  #add the row to the original data.frame
  df <-  cbind(MUN = c(as.character(df$MUN), name)
               ,rbind(df[, 2:ncol(df)], sum)
              )

  #return the data.frame from the function
  df

} 

#sum two rows and replace your data.frame by the new result
data <- sum_rows(data, "Angra dos Reis (RJ)", "Areal (RJ)")

data <- sum_rows(data, "Armação dos Búzios (RJ)", "Areal (RJ)")

如何按行对行进行求和？

4 个答案: