我试图在R(3.3.2)中创建一个新变量,方法是检查多列中一个因子的级别是否相同。
id<-c(1:5)
X1<-c("species1", "species1", NA, "species1", "species1")
X2<-c(NA, "species2", NA, "species2", "species2")
X3<-c("species1", "species2", "species2", "species3", "species3")
看起来应该是这样,检查X1:X3是否全部相同(忽略NAs):
id X1 X2 X3 same
[1,] 1 "species1" NA "species1" TRUE
[2,] 2 "species1" "species2" "species2" FALSE
[3,] 3 NA NA "species2" TRUE
[4,] 4 "species1" "species2" "species3" FALSE
[5,] 5 "species1" "species2" "species3" FALSE
编辑:这是我的实际数据,以及我在@ Mike的答案中使用的代码:
s$same <- apply(s[,c(2:11)], 1, function(x) length(unique((x[!is.na(x)]))) == 1)
dput(droplevels(head(s)))
structure(list(rowid = structure(c(5L, 6L, 4L, 3L, 2L, 1L), .Label = c("-68975029755346725",
"-6985608891139937154", "-7064257681237955764", "-716653329714258929",
"-7190954401213249258", "-7190954401427629087"), class = "factor"),
species1 = structure(c(3L, NA, 3L, 1L, 2L, NA), .Label = c("Mycobacterium avium complex",
"Mycobacterium fortuitum", "Mycobacterium kansasii"), class = "factor"),
species2 = structure(c(NA, NA, 4L, 2L, 3L, 1L), .Label = c(" Mycobacterium fortuitum",
"Mycobacterium avium complex", "Mycobacterium fortuitum",
"Mycobacterium kansasii"), class = "factor"), species3 = structure(c(4L,
NA, 3L, 1L, 2L, NA), .Label = c(" Mycobacterium avium complex",
" Mycobacterium fortuitum", " Mycobacterium kansasii", "Mycobacterium kansasii"
), class = "factor"), species4 = structure(c(NA, NA, NA,
NA, NA, 1L), .Label = " Mycobacterium fortuitum", class = "factor"),
species5 = structure(c(1L, NA, NA, NA, NA, NA), .Label = "Mycobacterium kansasii", class = "factor"),
species6 = structure(c(NA, NA, NA, NA, NA, 1L), .Label = " Mycobacterium fortuitum", class = "factor"),
species7 = structure(c(NA_integer_, NA_integer_, NA_integer_,
NA_integer_, NA_integer_, NA_integer_), .Label = character(0), class = "factor"),
species8 = structure(c(NA_integer_, NA_integer_, NA_integer_,
NA_integer_, NA_integer_, NA_integer_), .Label = character(0), class = "factor"),
species9 = structure(c(NA_integer_, NA_integer_, NA_integer_,
NA_integer_, NA_integer_, NA_integer_), .Label = character(0), class = "factor"),
species10 = structure(c(NA_integer_, NA_integer_, NA_integer_,
NA_integer_, NA_integer_, NA_integer_), .Label = character(0), class = "factor"),
same = c(TRUE, FALSE, FALSE, FALSE, FALSE, TRUE)), .Names = c("rowid",
"species1", "species2", "species3", "species4", "species5", "species6",
"species7", "species8", "species9", "species10", "same"), row.names = c(NA,
6L), class = "data.frame")
第1行和第6行是正确的,但在这个群组中所有都应该是真的。
我对apply
,ifelse
,all
和identical
的每个组合尝试了duplicated
和unique
我能想到的,但要么你不能使用na.rm
函数或我得到一个矩阵输出而不是一个新的变量。似乎有很多问题用数字变量做这个,但我很难用因子或字符串变量找到我需要的东西。在此先感谢您的帮助!
答案 0 :(得分:3)
如何使用length
和unique
检查只有1个唯一值?
df <- data.frame(id = id, X1 = X1, X2 = X2, X3 = X3)
df$same <- apply(df[,c("X1","X2","X3")], 1, function(x)
length(unique(trimws(x[!is.na(x)]))) == 1 | length(unique(trimws(x))) == 1)
df
# id X1 X2 X3 same
# 1 1 species1 <NA> species1 TRUE
# 2 2 species1 species2 species2 FALSE
# 3 3 <NA> <NA> species2 TRUE
# 4 4 species1 species2 species3 FALSE
# 5 5 species1 species2 species3 FALSE
在trimws()
中添加,以消除前导/尾随空白和条件,其中所有内容均为NA
。