Question

我有这个问题，希望有人能帮忙。

我在R中有一个非常大的数据帧（接近2000万个观测值），大约有43列，在其中的四列中，我需要查找是否有多个小于等于200的最小值，然后如果有行，不止一个列具有满足此条件的相同值，我需要将该行标记为TRUE（在一个新的标记列中）。请注意，这些列包含NA值，并且不应该使用NA（当要比较的列中存在NA时，返回NA）

目标是查找a1到a4列的每一行中的值，并确定不超过200的最小值是否出现在每行多列中

为简单起见，这就是我的数据数据的样子

head(mydata)
t1  a1  a2  a3  a4 
34  NA  NA  NA  NA
26  10  15  250 150
34  20  20  100 30 
35  5   5   10  5  
25  45  100 3   45
31 400 310 500 310 
")

目标是在a1到a4列的每一行中查找值，并找出不超过200的最小值是否出现在每行多列中，如果它的确返回true，则返回false，< / p>

预期结果将如下所示

head(mydata)
t1  a1  a2  a3  a4  flag
34  NA  NA  NA  NA  NA
26  10  15  250 150 FALSE
34  20  20  100 30  TRUE
35  5   5   10  5   TRUE
25  45  100 3   45  FALSE
31 400 310 500 310  FALSE
")

谢谢。

Answer 1

这是基本的R方式

#Get the column indices where a1, a2, a3 and a4 are there
inds <- match(paste0("a", 1:4), names(df))

#Get row-wise minimum
min_val <- do.call(pmin, df[inds])

#Check if there are more than one occurrence of minimum value 
# and if minimum value is less than 200.
df$flag <- rowSums(df[inds] == min_val) > 1 & min_val < 200

df
#  t1  a1  a2  a3  a4  flag
#1 34  NA  NA  NA  NA    NA
#2 26  10  15 250 150 FALSE
#3 34  20  20 100  30  TRUE
#4 35   5   5  10   5  TRUE
#5 25  45 100   3  45 FALSE
#6 31 400 310 500 310 FALSE

Answer 2

这对您有帮助吗？：

mydata$flag=apply(mydata,1,function(x){  # iterate through rows
    x=na.omit(x);        # omit NAs in a row (optional)
    tab=table(x[x<200]); # count numbers of all row values below 200
    if(any(tab>1)){      # check if any values are not unique
          return(TRUE)
          }else{
          return(FALSE)
         }})

您可以选择是否包含NA值。

Answer 3

Here's a purrr solution. I create the data frame.

# Define data frame
df <- read.table(text = " t1  a1  a2  a3  a4 
                  34  NA  NA  NA  NA
                  26  10  15  250 150
                  34  20  20  100 30 
                  35  5   5   10  5  
                  25  45  100 3   45
                  31 400 310 500 310 ", header = TRUE)

Next, I load the library.

# Load library
library(purrr)

Then, I create the flag, running through each row using pmap_lgl, which returns a logical. This line checks if there is more than one minimum value and that the minimum is below 200. The first column is omitted from the calculation. If there are NA values in each row, an NA will be created.

# Create flag
df$flag <- pmap_lgl(df, function(...)(sum(c(...)[-1] == min(c(...)[-1])) > 1) & min(c(...)[-1]) < 200)

This gives the following:

# Examine result
df
#>   t1  a1  a2  a3  a4  flag
#> 1 34  NA  NA  NA  NA    NA
#> 2 26  10  15 250 150 FALSE
#> 3 34  20  20 100  30  TRUE
#> 4 35   5   5  10   5  TRUE
#> 5 25  45 100   3  45 FALSE
#> 6 31 400 310 500 310 FALSE

^{Created on 2019-05-31 by the reprex package (v0.3.0)}

Answer 4

One possibility involving dplyr and purrr:

df %>%
 mutate(flag = exec(pmin, !!!.[-1]),
        flag = rowSums(.[-1] == flag) > 1 & flag < 200)

  t1  a1  a2  a3  a4  flag
1 34  NA  NA  NA  NA    NA
2 26  10  15 250 150 FALSE
3 34  20  20 100  30  TRUE
4 35   5   5  10   5  TRUE
5 25  45 100   3  45 FALSE
6 31 400 310 500 310 FALSE

Here it checks whether the occurrence of row-wise minimum is greater than 1 and whether the row-wise minimum is below 200.

Answer 5

如果数据集很大，则以下操作可能会很快。它使用软件包matrixStats和函数rowMins。参见this answer。

icol <- grepl("^a", names(mydata))
min_row <- matrixStats::rowMins(as.matrix(mydata[icol]))

mydata$flag <- rowSums(mydata[icol] == min_row) > 1 & min_row < 200

如何查找特定列中是否有两个或更多个相等的最小值

5 个答案: