如何按行对行进行求和?

时间:2017-10-23 18:10:14

标签: r dataframe sum row geography

我现在在R里有点新手......所以我正在做一个大学项目的人口普查研究。 用于插图,这是我的data.frame

的一部分
             MUN          X1990  X1991  X1992 X1993
1     Angra dos Reis (RJ)    11    10    10    10
2            Aperibé (RJ)    NA    NA    NA    NA
3           Araruama (RJ)  12040 14589 14231 14231
4              Areal (RJ)    NA    NA    NA     3
5 Armação dos Búzios (RJ)    NA    NA    NA    NA

我的问题是,我需要总结一些我知道/将要指定名称的市政行,(因为我不知道它将会出现的顺序,或者它们是否会出现在我的所有表格中),结果应该连续显示。

举个例子, 我想将行“Areal”与行“Angra dos Reis”相加,结果存储在另一个创建的行中(让我们调用结果行:X) 所以结果应该是:

             MUN          X1990  X1991  X1992 X1993
1     Angra dos Reis (RJ)    11    10    10    10
2            Aperibé (RJ)    NA    NA    NA    NA
3           Araruama (RJ)  12040 14589 14231 14231
4              Areal (RJ)    NA    NA    NA     3
5 Armação dos Búzios (RJ)    NA    NA    NA    NA
6          X                 11    10    10    13

我试图创建一个for循环和一个if循环,但我无法做到这一点。

4 个答案:

答案 0 :(得分:4)

这与Jaap的评论非常相似,但更多地说明并明确使用行名称:

mat = as.matrix(dat[, 2:5])
row.names(mat) = dat$MUN
mat = rbind(mat, colSums(mat[c("Angra dos Reis (RJ)", "Areal (RJ)"), ], na.rm = T))
row.names(mat)[nrow(mat)] = "X"
mat
#                         X1990 X1991 X1992 X1993
# Angra dos Reis (RJ)        11    10    10    10
# Aperibé (RJ)               NA    NA    NA    NA
# Araruama (RJ)           12040 14589 14231 14231
# Areal (RJ)                 NA    NA    NA     3
# Armação dos Búzios (RJ)    NA    NA    NA    NA
# X                          11    10    10    13

结果是matrix,如果需要,您可以将其转换回数据框:

dat_result = data.frame(MUN = row.names(mat), mat, row.names = NULL)

我不喜欢数据格式作为数据框。我会将其转换为矩阵(如上所述)或将其转换为长格式,例如tidyr::gather(dat, key = year, value = value, -MUN)并使用data.tabledplyr“按组”使用它。

使用此数据:

dat = read.table(text = "             MUN          X1990  X1991  X1992 X1993
1     'Angra dos Reis (RJ)'    11    10    10    10
2            'Aperibé (RJ)'    NA    NA    NA    NA
3           'Araruama (RJ)'  12040 14589 14231 14231
4              'Areal (RJ)'    NA    NA    NA     3
5 'Armação dos Búzios (RJ)'    NA    NA    NA    NA", header= T)

答案 1 :(得分:2)

解决方案可以使用sqldf包。如果数据框的名称为df,则可以执行以下操作:

library(sqldf)
result <- sqldf("SELECT * FROM df UNION 
       SELECT 'X', SUM(X1990), SUM(X1991), SUM(X1992), SUM(X1993) FROM df
       WHERE MUN IN ('Angra dos Reis (RJ)', 'Areal (RJ)')")

答案 2 :(得分:2)

以下是dplyr解决方案:

library(dplyr)
df %>%
  filter(MUN %in% c("Angra dos Reis (RJ)", "Areal (RJ)")) %>%
  summarize_if(is.numeric, sum, na.rm = TRUE) %>%
  as.list(.) %>%
  c(MUN = "X") %>%
  bind_rows(df, .)

<强>结果:

                      MUN X1990 X1991 X1992 X1993
1     Angra dos Reis (RJ)    11    10    10    10
2            Aperibé (RJ)    NA    NA    NA    NA
3           Araruama (RJ) 12040 14589 14231 14231
4              Areal (RJ)    NA    NA    NA     3
5 Armação dos Búzios (RJ)    NA    NA    NA    NA
6                       X    11    10    10    13

数据(来自@Gregor和stringsAsFactors = FALSE):

df = read.table(text = "             MUN          X1990  X1991  X1992 X1993
                 1     'Angra dos Reis (RJ)'    11    10    10    10
                 2            'Aperibé (RJ)'    NA    NA    NA    NA
                 3           'Araruama (RJ)'  12040 14589 14231 14231
                 4              'Areal (RJ)'    NA    NA    NA     3
                 5 'Armação dos Búzios (RJ)'    NA    NA    NA    NA", header= T, stringsAsFactors = FALSE)

答案 3 :(得分:0)

我假设您想要汇总您知道/指定名称的两个市政府的数据,然后在表格的末尾添加它们的总和。我不确定这种理解是否正确。您可能需要再次指定您的问题,以防下面的代码不符合您的要求(例如,如果您需要每次总结多个市镇,或者一次只需要两个市镇,等等)。

此外,如果您必须多次调用我提出的函数或者您的表格非常大,则需要在速度方面进行改进,例如,使用包data.table而不是基础R(因为你说你是初学者我坚持基础R)。

为了满足您保留NA值的请求,我已使用Joshua Ulrich建议的代码作为此问题rowSums but keeping NA values的答案。

data <- data.frame(MUN = c("Angra dos Reis (RJ)", "Aperibé (RJ)", "Araruama (RJ)", "Areal (RJ)", "Armação dos Búzios (RJ)")
               ,X1990 = c(11, NA, 12040, NA, NA)
               ,X1991 = c(10, NA, 14589, NA, NA)
               ,X1992 = c(10, NA, 14231, NA, NA)
               ,X1993 = c(10, NA, 12231, 3, NA)
)

sum_rows <- function(df, row1, row2) {

  #get the indices of the two rows to be summed
  #grep returns the position in a vector at which a certain element is stored
  #here the name of the municipality 
  index_row1 <-  grep(row1, df$MUN, fixed=T)
  index_row2 <-  grep(row2, df$MUN, fixed=T)

  #select the two rows of the data.frame that you want to sum
  #on basis of the entry in the MUN column
  #further only select the column with numbers for the sum operation
  #check if all entries in a single column are NA values
  #if yes then the ouput for this column is NA
  #if no calculate the column sum, if one entry is NA, ignore it
  sum <- ifelse(apply(is.na(df[c(index_row1, index_row2),2:ncol(df)]),2,all)
                      ,NA
                      ,colSums(df[c(index_row1, index_row2),2:ncol(df)],na.rm=TRUE)
               )

  #create a name entry for the new MUN column
  #paste0 is used to combine strings
  #in this case it might make sense to create a name 
  #that includes the indices of the rows that have been summed instad of only using X as name
  name <- paste0("Sum_R",index_row1,"_R" , index_row2)

  #add the row to the original data.frame
  df <-  cbind(MUN = c(as.character(df$MUN), name)
               ,rbind(df[, 2:ncol(df)], sum)
              )

  #return the data.frame from the function
  df

} 

#sum two rows and replace your data.frame by the new result
data <- sum_rows(data, "Angra dos Reis (RJ)", "Areal (RJ)")

data <- sum_rows(data, "Armação dos Búzios (RJ)", "Areal (RJ)")