Question

我有一个数据框，例如，从不同列中的实验中复制。如果我的数据框中的每一行都是一个样本，并且将a，b，c列作为重复项，我想：

确定重复之间的差异（每行中最高值和最低值之间的差异）？把它放在一个名为“变体”的新专栏中。）
如果变化大于10，则省略最远的一个复制品。

如何在此数据框中完成此操作？我想要新的专栏： “max” - 每行的a，b，c的最高值 “min” - 每行的a，b，c的最小值 “变化” - 每行的最大/分钟

然后，我想省略a，b或c中距离其他点最远的数据点，以便剩余的点数<10变化。

df <- data.frame(a = rnorm(10, 100, 20),
                 b = rnorm(10, 2000, 500),
                 c = rnorm(10, 50, 20))
df$max = apply(df, 1, max, na.rm = T)
df$min = apply(df, 1, min, na.rm = T)
df$variation = df$max/df$min

（另外，如何使用dplyr和％＆gt;％表示法计算最大值和最小值？）

Answer 1

使用dplyr pipes，mutate和group_by的示例。我使用tidyr gather以长格式重新整理了数据，并使用spread在最后以宽格式重新整形。

library(dplyr)
library(tidyr)
set.seed(100)
dtf_wide <- data.frame(a = rnorm(10, 100, 20),
                 b = rnorm(10, 2000, 500),
                 c = rnorm(10, 50, 20))

以长格式重塑数据。按ID分组（宽格式的行号）然后计算变化和距中值的距离。

dtf <- dtf_wide %>% 
    # Explicitely add an identification column (for the grouping)
    mutate(id = row_number()) %>% 
    # put data in tidy format, one observation per row
    gather(key, value, a:c) %>% 
    arrange(id) %>% 
    group_by(id) %>%
    mutate(variation = max(value, na.rm = TRUE) / min(value, na.rm = TRUE),
           median = median(value),
           distancefrommedian = abs(value-median),
           maxdistancefrommedian = max(distancefrommedian)) 
head(dtf)

# # A tibble: 6 x 7
# # Groups:   id [2]
#      id   key      value variation    median distancefrommedian maxdistancefrommedian
#   <int> <chr>      <dbl>     <dbl>     <dbl>              <dbl>                 <dbl>
# 1     1     a   89.95615  49.58856  89.95615            0.00000              1954.987
# 2     1     b 2044.94307  49.58856  89.95615         1954.98692              1954.987
# 3     1     c   41.23820  49.58856  89.95615           48.71795              1954.987
# 4     2     a  102.63062  31.37407 102.63062            0.00000              1945.507
# 5     2     b 2048.13723  31.37407 102.63062         1945.50661              1945.507
# 6     2     c   65.28121  31.37407 102.63062           37.34941              1945.507

如果varation大于10，请删除值远离中位数的行（您可以在此处更改该规则，以便在需要时删除更多行）。

dtf <- dtf %>% 
    # For each id, 
    # Take all lines where variation is smaller than 10
    filter(variation <= 10 |
               # If varation is greater than 10,
               # Filter out lines were the value is further away from the median 
               (variation > 10 & distancefrommedian < maxdistancefrommedian)) %>% 
    # Keep only interesting variables
    select(id, key, value) %>% 
    # Compute the variations again (just to check)
    mutate(variation = max(value, na.rm = TRUE) / min(value, na.rm = TRUE)) 
head(dtf)
#      id   key     value variation
#   <int> <chr>     <dbl>     <dbl>
# 1     1     a  89.95615  2.181379
# 2     1     c  41.23820  2.181379
# 3     2     a 102.63062  1.572131
# 4     2     c  65.28121  1.572131
# 5     3     a  98.42166  1.781735
# 6     3     c  55.23923  1.781735

重塑数据以获得与原始数据框类似的宽格式表格。

dtf_wide2 <- dtf %>% 
    spread(key, value) 
head(dtf_wide2)
#      id variation         a        c
#   <int>     <dbl>     <dbl>    <dbl>
# 1     1  4.385692  89.95615 41.23820
# 2     2  4.385692 102.63062 65.28121
# 3     3  4.385692  98.42166 55.23923
# 4     4  4.385692 117.73570 65.46809
# 5     5  4.385692 102.33943 33.71242
# 6     6  4.385692 106.37260 41.23099

在数据框中，选择与其他列变量最大的列

1 个答案: