我有一个数据框,例如,从不同列中的实验中复制。如果我的数据框中的每一行都是一个样本,并且将a,b,c列作为重复项,我想:
如何在此数据框中完成此操作?我想要新的专栏: “max” - 每行的a,b,c的最高值 “min” - 每行的a,b,c的最小值 “变化” - 每行的最大/分钟
然后,我想省略a,b或c中距离其他点最远的数据点,以便剩余的点数<10变化。
df <- data.frame(a = rnorm(10, 100, 20),
b = rnorm(10, 2000, 500),
c = rnorm(10, 50, 20))
df$max = apply(df, 1, max, na.rm = T)
df$min = apply(df, 1, min, na.rm = T)
df$variation = df$max/df$min
(另外,如何使用dplyr和%&gt;%表示法计算最大值和最小值?)
答案 0 :(得分:0)
使用dplyr pipes,mutate和group_by的示例。我使用tidyr gather以长格式重新整理了数据,并使用spread在最后以宽格式重新整形。
library(dplyr)
library(tidyr)
set.seed(100)
dtf_wide <- data.frame(a = rnorm(10, 100, 20),
b = rnorm(10, 2000, 500),
c = rnorm(10, 50, 20))
以长格式重塑数据。按ID分组(宽格式的行号)然后计算变化和距中值的距离。
dtf <- dtf_wide %>%
# Explicitely add an identification column (for the grouping)
mutate(id = row_number()) %>%
# put data in tidy format, one observation per row
gather(key, value, a:c) %>%
arrange(id) %>%
group_by(id) %>%
mutate(variation = max(value, na.rm = TRUE) / min(value, na.rm = TRUE),
median = median(value),
distancefrommedian = abs(value-median),
maxdistancefrommedian = max(distancefrommedian))
head(dtf)
# # A tibble: 6 x 7
# # Groups: id [2]
# id key value variation median distancefrommedian maxdistancefrommedian
# <int> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 1 a 89.95615 49.58856 89.95615 0.00000 1954.987
# 2 1 b 2044.94307 49.58856 89.95615 1954.98692 1954.987
# 3 1 c 41.23820 49.58856 89.95615 48.71795 1954.987
# 4 2 a 102.63062 31.37407 102.63062 0.00000 1945.507
# 5 2 b 2048.13723 31.37407 102.63062 1945.50661 1945.507
# 6 2 c 65.28121 31.37407 102.63062 37.34941 1945.507
如果varation大于10,请删除值远离中位数的行(您可以在此处更改该规则,以便在需要时删除更多行)。
dtf <- dtf %>%
# For each id,
# Take all lines where variation is smaller than 10
filter(variation <= 10 |
# If varation is greater than 10,
# Filter out lines were the value is further away from the median
(variation > 10 & distancefrommedian < maxdistancefrommedian)) %>%
# Keep only interesting variables
select(id, key, value) %>%
# Compute the variations again (just to check)
mutate(variation = max(value, na.rm = TRUE) / min(value, na.rm = TRUE))
head(dtf)
# id key value variation
# <int> <chr> <dbl> <dbl>
# 1 1 a 89.95615 2.181379
# 2 1 c 41.23820 2.181379
# 3 2 a 102.63062 1.572131
# 4 2 c 65.28121 1.572131
# 5 3 a 98.42166 1.781735
# 6 3 c 55.23923 1.781735
重塑数据以获得与原始数据框类似的宽格式表格。
dtf_wide2 <- dtf %>%
spread(key, value)
head(dtf_wide2)
# id variation a c
# <int> <dbl> <dbl> <dbl>
# 1 1 4.385692 89.95615 41.23820
# 2 2 4.385692 102.63062 65.28121
# 3 3 4.385692 98.42166 55.23923
# 4 4 4.385692 117.73570 65.46809
# 5 5 4.385692 102.33943 33.71242
# 6 6 4.385692 106.37260 41.23099