Question

我正在尝试对R中的沉积物粒度等级的测量数据求平均值。我通常每个样品进行8次测量，但是有时设备出现故障或测量结果不可靠（例如，由于3次测量后的机械问题等）。结果，该测量值看起来与同一样本的其余部分完全不同，因此我认为异常值测试可以自动将其删除。

我尝试使用dplyr遵循以下代码：

https://www.r-bloggers.com/combined-outlier-detection-with-dplyr-and-ruler/

，但是似乎并非所有组的长度都相同是一个问题。我还发现了这种针对矢量的方法

https://stackoverflow.com/a/4788102/8321705

但是我不知道如何预先对数据进行分组，然后将其应用于每个组的每一列。

这是我的数据的头部，一些样本仅进行了3次重复测量。前3个数字列以一种方式描述粒子分数，后6个列以另一种方式描述：

#my data with unequal group length
test <- structure(list(Sample_name = c("Sediment1140", "Sediment1140", 
"Sediment1140", "Sediment1140", "Sediment1140", "Sediment1140", 
"Sediment1140", "Sediment1140", "Sediment1141", "Sediment1141", 
"Sediment1141", "Sediment1141", "Sediment1141", "Sediment1141", 
"Sediment1141", "Sediment1141", "Sediment1142", "Sediment1142", 
"Sediment1142", "Sediment1142"), Dx_10_percent = c(228.3413627, 
232.9637155, 236.4058197, 235.4124387, 238.2260309, 238.983854, 
237.0509773, 234.22402, 245.5622443, 247.1072046, 248.7302949, 
247.7311716, 253.6328878, 249.7883614, 245.8217667, 247.047291, 
183.5981354, 186.4353531, 184.4024079, 183.9496282), Dx_50_percent = c(464.4559559, 
470.4392019, 479.0066087, 474.75933, 478.1515348, 481.8823096, 
480.3117339, 476.2827332, 442.6699831, 443.3890093, 442.4344575, 
435.0531805, 443.4543899, 447.494161, 434.7639443, 433.3472111, 
336.6085081, 340.9353695, 336.0106474, 340.0936298), Dx_90_percent = c(854.6392436, 
856.8504381, 879.5524457, 880.468129, 858.3297603, 905.0097741, 
879.5146896, 873.8584305, 819.4818726, 816.5296812, 778.9013718, 
766.7617116, 770.5702479, 829.0866972, 766.083991, 751.9915196, 
656.4105245, 664.7034131, 698.6157344, 718.225128), Microm_001 = c(0.797059348, 
0.801571015, 0.734207569, 0.841152063, 0.834553976, 0.75429789, 
0.831636299, 1.000633239, 0.713401217, 0.74354612, 0.741372753, 
0.841747801, 0.727424775, 0.755804532, 1.163420288, 1.081749441, 
0.27579194, 0.483475909, 0.555629788, 0.697398689), Microm_63 = c(0.90472056, 
0.944959738, 0.94659555, 0.903114644, 0.96501726, 0.91079578, 
0.954569594, 0.987258593, 0.000822487, 0.000571334, 0.000442701, 
0.000297749, 0, 0.000259136, 0.000891928, 0.000769421, 0.923900573, 
0.793744127, 0.809342888, 0.839719189), Microm_125 = c(11.30751247, 
10.58007103, 10.18149105, 10.27507954, 9.833963719, 9.901098909, 
9.983293752, 10.11735892, 10.0938523, 9.776308321, 9.483238809, 
9.57495155, 8.647488515, 9.280930949, 9.601526818, 9.458248991, 
26.14106339, 25.04179051, 25.88123946, 25.37747955), Microm_250 = c(42.63011079, 
42.40400703, 41.42517791, 41.94288617, 41.87864039, 41.23829484, 
41.31751454, 41.63495869, 49.13692679, 49.3879579, 50.23445206, 
51.60666791, 51.1727376, 49.11877239, 51.26674279, 52.00409162, 
50.48687475, 50.90777981, 50.23482884, 49.24764135), Microm_500 = c(39.88402392, 
40.77838926, 41.50718101, 40.65764848, 42.15199068, 40.83119895, 
41.73250567, 41.18885416, 35.87934642, 36.00915655, 37.3180857, 
35.62608032, 37.41951083, 36.26828346, 35.6623117, 35.71836133, 
20.18297131, 20.1836753, 17.33407897, 19.10041196), Microm_1000 = c(4.476572917, 
4.49100193, 5.205346918, 5.380119103, 4.335833973, 6.364313631, 
5.180480148, 5.070936398, 4.175650784, 4.082459775, 2.222407984, 
2.350254673, 2.032838283, 4.575949528, 2.305106476, 1.736779196, 
1.98939803, 2.589534343, 5.184880055, 4.737349268)), row.names = c(1L, 
2L, 3L, 4L, 5L, 6L, 7L, 8L, 10L, 11L, 12L, 13L, 14L, 15L, 16L, 
17L, 19L, 20L, 21L, 22L), class = "data.frame")

我的伪代码看起来像这样：

#define function from SO answer
remove_outliers <- function(x, na.rm = TRUE, ...) {
  qnt <- quantile(x, probs=c(.25, .75), na.rm = na.rm, ...)
  H <- 1.5 * IQR(x, na.rm = na.rm)
  y <- x
  y[x < (qnt[1] - H)] <- NA
  y[x > (qnt[2] + H)] <- NA
  y
}

# using dplyr
out_test <- test %>%
  group_by(Sample_name) %>%
  apply(2, remove_outliers)

# using base R by
out_test2 <- by(test, test$Sample_name, remove_outliers)

我该如何标记/检测与同一样本的平行行明显不同的行或直接将其删除？

哦，还有一个额外的问题：从统计数据的角度来看，有8个样本足以确定一个异常值吗？就我而言，这是由于测量失败而导致的极端情况，但是没有其他人会误导。

Answer 1

在排除分组变量后使用LSApplicationQueriesSchemes。

mutate_at

检测分组长度不同的分组data.frame列中的离群值

1 个答案: