我现在在R里有点新手......所以我正在做一个大学项目的人口普查研究。 用于插图,这是我的data.frame
的一部分 MUN X1990 X1991 X1992 X1993
1 Angra dos Reis (RJ) 11 10 10 10
2 Aperibé (RJ) NA NA NA NA
3 Araruama (RJ) 12040 14589 14231 14231
4 Areal (RJ) NA NA NA 3
5 Armação dos Búzios (RJ) NA NA NA NA
我的问题是,我需要总结一些我知道/将要指定名称的市政行,(因为我不知道它将会出现的顺序,或者它们是否会出现在我的所有表格中),结果应该连续显示。
举个例子, 我想将行“Areal”与行“Angra dos Reis”相加,结果存储在另一个创建的行中(让我们调用结果行:X) 所以结果应该是:
MUN X1990 X1991 X1992 X1993
1 Angra dos Reis (RJ) 11 10 10 10
2 Aperibé (RJ) NA NA NA NA
3 Araruama (RJ) 12040 14589 14231 14231
4 Areal (RJ) NA NA NA 3
5 Armação dos Búzios (RJ) NA NA NA NA
6 X 11 10 10 13
我试图创建一个for循环和一个if循环,但我无法做到这一点。
答案 0 :(得分:4)
这与Jaap的评论非常相似,但更多地说明并明确使用行名称:
mat = as.matrix(dat[, 2:5])
row.names(mat) = dat$MUN
mat = rbind(mat, colSums(mat[c("Angra dos Reis (RJ)", "Areal (RJ)"), ], na.rm = T))
row.names(mat)[nrow(mat)] = "X"
mat
# X1990 X1991 X1992 X1993
# Angra dos Reis (RJ) 11 10 10 10
# Aperibé (RJ) NA NA NA NA
# Araruama (RJ) 12040 14589 14231 14231
# Areal (RJ) NA NA NA 3
# Armação dos Búzios (RJ) NA NA NA NA
# X 11 10 10 13
结果是matrix
,如果需要,您可以将其转换回数据框:
dat_result = data.frame(MUN = row.names(mat), mat, row.names = NULL)
我不喜欢数据格式作为数据框。我会将其转换为矩阵(如上所述)或将其转换为长格式,例如tidyr::gather(dat, key = year, value = value, -MUN)
并使用data.table
或dplyr
“按组”使用它。
使用此数据:
dat = read.table(text = " MUN X1990 X1991 X1992 X1993
1 'Angra dos Reis (RJ)' 11 10 10 10
2 'Aperibé (RJ)' NA NA NA NA
3 'Araruama (RJ)' 12040 14589 14231 14231
4 'Areal (RJ)' NA NA NA 3
5 'Armação dos Búzios (RJ)' NA NA NA NA", header= T)
答案 1 :(得分:2)
解决方案可以使用sqldf包。如果数据框的名称为df
,则可以执行以下操作:
library(sqldf)
result <- sqldf("SELECT * FROM df UNION
SELECT 'X', SUM(X1990), SUM(X1991), SUM(X1992), SUM(X1993) FROM df
WHERE MUN IN ('Angra dos Reis (RJ)', 'Areal (RJ)')")
答案 2 :(得分:2)
以下是dplyr
解决方案:
library(dplyr)
df %>%
filter(MUN %in% c("Angra dos Reis (RJ)", "Areal (RJ)")) %>%
summarize_if(is.numeric, sum, na.rm = TRUE) %>%
as.list(.) %>%
c(MUN = "X") %>%
bind_rows(df, .)
<强>结果:强>
MUN X1990 X1991 X1992 X1993
1 Angra dos Reis (RJ) 11 10 10 10
2 Aperibé (RJ) NA NA NA NA
3 Araruama (RJ) 12040 14589 14231 14231
4 Areal (RJ) NA NA NA 3
5 Armação dos Búzios (RJ) NA NA NA NA
6 X 11 10 10 13
数据(来自@Gregor和stringsAsFactors = FALSE
):
df = read.table(text = " MUN X1990 X1991 X1992 X1993
1 'Angra dos Reis (RJ)' 11 10 10 10
2 'Aperibé (RJ)' NA NA NA NA
3 'Araruama (RJ)' 12040 14589 14231 14231
4 'Areal (RJ)' NA NA NA 3
5 'Armação dos Búzios (RJ)' NA NA NA NA", header= T, stringsAsFactors = FALSE)
答案 3 :(得分:0)
我假设您想要汇总您知道/指定名称的两个市政府的数据,然后在表格的末尾添加它们的总和。我不确定这种理解是否正确。您可能需要再次指定您的问题,以防下面的代码不符合您的要求(例如,如果您需要每次总结多个市镇,或者一次只需要两个市镇,等等)。
此外,如果您必须多次调用我提出的函数或者您的表格非常大,则需要在速度方面进行改进,例如,使用包data.table
而不是基础R(因为你说你是初学者我坚持基础R)。
为了满足您保留NA值的请求,我已使用Joshua Ulrich建议的代码作为此问题rowSums but keeping NA values的答案。
data <- data.frame(MUN = c("Angra dos Reis (RJ)", "Aperibé (RJ)", "Araruama (RJ)", "Areal (RJ)", "Armação dos Búzios (RJ)")
,X1990 = c(11, NA, 12040, NA, NA)
,X1991 = c(10, NA, 14589, NA, NA)
,X1992 = c(10, NA, 14231, NA, NA)
,X1993 = c(10, NA, 12231, 3, NA)
)
sum_rows <- function(df, row1, row2) {
#get the indices of the two rows to be summed
#grep returns the position in a vector at which a certain element is stored
#here the name of the municipality
index_row1 <- grep(row1, df$MUN, fixed=T)
index_row2 <- grep(row2, df$MUN, fixed=T)
#select the two rows of the data.frame that you want to sum
#on basis of the entry in the MUN column
#further only select the column with numbers for the sum operation
#check if all entries in a single column are NA values
#if yes then the ouput for this column is NA
#if no calculate the column sum, if one entry is NA, ignore it
sum <- ifelse(apply(is.na(df[c(index_row1, index_row2),2:ncol(df)]),2,all)
,NA
,colSums(df[c(index_row1, index_row2),2:ncol(df)],na.rm=TRUE)
)
#create a name entry for the new MUN column
#paste0 is used to combine strings
#in this case it might make sense to create a name
#that includes the indices of the rows that have been summed instad of only using X as name
name <- paste0("Sum_R",index_row1,"_R" , index_row2)
#add the row to the original data.frame
df <- cbind(MUN = c(as.character(df$MUN), name)
,rbind(df[, 2:ncol(df)], sum)
)
#return the data.frame from the function
df
}
#sum two rows and replace your data.frame by the new result
data <- sum_rows(data, "Angra dos Reis (RJ)", "Areal (RJ)")
data <- sum_rows(data, "Armação dos Búzios (RJ)", "Areal (RJ)")