我有这个问题,希望有人能帮忙。
我在R中有一个非常大的数据帧(接近2000万个观测值),大约有43列,在其中的四列中,我需要查找是否有多个小于等于200的最小值,然后如果有行,不止一个列具有满足此条件的相同值,我需要将该行标记为TRUE(在一个新的标记列中)。请注意,这些列包含NA值,并且不应该使用NA
(当要比较的列中存在NA时,返回NA)
目标是查找a1到a4列的每一行中的值,并确定不超过200的最小值是否出现在每行多列中
为简单起见,这就是我的数据数据的样子
head(mydata)
t1 a1 a2 a3 a4
34 NA NA NA NA
26 10 15 250 150
34 20 20 100 30
35 5 5 10 5
25 45 100 3 45
31 400 310 500 310
")
目标是在a1到a4列的每一行中查找值,并找出不超过200的最小值是否出现在每行多列中,如果它的确返回true,则返回false,< / p>
预期结果将如下所示
head(mydata)
t1 a1 a2 a3 a4 flag
34 NA NA NA NA NA
26 10 15 250 150 FALSE
34 20 20 100 30 TRUE
35 5 5 10 5 TRUE
25 45 100 3 45 FALSE
31 400 310 500 310 FALSE
")
谢谢。
答案 0 :(得分:3)
这是基本的R方式
#Get the column indices where a1, a2, a3 and a4 are there
inds <- match(paste0("a", 1:4), names(df))
#Get row-wise minimum
min_val <- do.call(pmin, df[inds])
#Check if there are more than one occurrence of minimum value
# and if minimum value is less than 200.
df$flag <- rowSums(df[inds] == min_val) > 1 & min_val < 200
df
# t1 a1 a2 a3 a4 flag
#1 34 NA NA NA NA NA
#2 26 10 15 250 150 FALSE
#3 34 20 20 100 30 TRUE
#4 35 5 5 10 5 TRUE
#5 25 45 100 3 45 FALSE
#6 31 400 310 500 310 FALSE
答案 1 :(得分:2)
这对您有帮助吗?:
mydata$flag=apply(mydata,1,function(x){ # iterate through rows
x=na.omit(x); # omit NAs in a row (optional)
tab=table(x[x<200]); # count numbers of all row values below 200
if(any(tab>1)){ # check if any values are not unique
return(TRUE)
}else{
return(FALSE)
}})
您可以选择是否包含NA
值。
答案 2 :(得分:0)
Here's a purrr
solution. I create the data frame.
# Define data frame
df <- read.table(text = " t1 a1 a2 a3 a4
34 NA NA NA NA
26 10 15 250 150
34 20 20 100 30
35 5 5 10 5
25 45 100 3 45
31 400 310 500 310 ", header = TRUE)
Next, I load the library.
# Load library
library(purrr)
Then, I create the flag, running through each row using pmap_lgl
, which returns a logical. This line checks if there is more than one minimum value and that the minimum is below 200. The first column is omitted from the calculation. If there are NA
values in each row, an NA
will be created.
# Create flag
df$flag <- pmap_lgl(df, function(...)(sum(c(...)[-1] == min(c(...)[-1])) > 1) & min(c(...)[-1]) < 200)
This gives the following:
# Examine result
df
#> t1 a1 a2 a3 a4 flag
#> 1 34 NA NA NA NA NA
#> 2 26 10 15 250 150 FALSE
#> 3 34 20 20 100 30 TRUE
#> 4 35 5 5 10 5 TRUE
#> 5 25 45 100 3 45 FALSE
#> 6 31 400 310 500 310 FALSE
Created on 2019-05-31 by the reprex package (v0.3.0)
答案 3 :(得分:0)
One possibility involving dplyr
and purrr
:
df %>%
mutate(flag = exec(pmin, !!!.[-1]),
flag = rowSums(.[-1] == flag) > 1 & flag < 200)
t1 a1 a2 a3 a4 flag
1 34 NA NA NA NA NA
2 26 10 15 250 150 FALSE
3 34 20 20 100 30 TRUE
4 35 5 5 10 5 TRUE
5 25 45 100 3 45 FALSE
6 31 400 310 500 310 FALSE
Here it checks whether the occurrence of row-wise minimum is greater than 1 and whether the row-wise minimum is below 200.
答案 4 :(得分:0)
如果数据集很大,则以下操作可能会很快。它使用软件包matrixStats
和函数rowMins
。参见this answer。
icol <- grepl("^a", names(mydata))
min_row <- matrixStats::rowMins(as.matrix(mydata[icol]))
mydata$flag <- rowSums(mydata[icol] == min_row) > 1 & min_row < 200