Question

我正在使用R，我正在缩放原始数据，删除Z分数为3或更高的所有异常值，然后过滤掉未缩放的数据，使其仅包含非异常值。我希望在删除异常值后留下包含非缩放数字的数据框。这些是我的步骤：

步骤
1.创建相同数据的两个数据框（ x，y ）
2.缩放 x 并保持 y 不缩放。
3.过滤掉 x 中所有行数超过3 Z-Score的行
4.目前，例如， x 可能有95,000行，而 y 仍然有100,000
5.根据名为行ID 的唯一列截断 y ，我确保在 x 中对其进行了缩放。这个独特的列可以帮助我匹配 x 中的剩余行和 y 中的行。
6. y 现在应该与 x 具有相同的行数，但数据未缩放。 x 包含缩放数据。

目前我无法取消数据的撤消。我尝试使用非量程方法或数据帧比较工具，但R抱怨我无法处理两种不同大小的数据帧。有解决方法吗？

尝试
我已尝试dataFrame <- dataFrame[dataFrame$Row %in% remainingRows]，但在我的数据框中没有留下任何内容。

我也会提供数据，但它有敏感的信息，所以任何数据框都会这样做，只要它有一个在缩放过程中不会改变的唯一行ID。

Answer 1

如果我理解你想做什么，我建议采用不同的方法。您可以使用两个data.frames，但如果您使用dplyr包，则可以在一行代码中执行所有操作......并且可能更快。

首先，我生成一个行数为100k的data.frame，其中ID列（仅1：100000序列）和value（随机数）。

以下是代码：

library(dplyr)

#generate data
x <- data.frame(ID=1:100000,value=runif(100000,max=100)*runif(10000,max=100))

#take a look

> head(x)
  ID      value
1  1  853.67941
2  2  632.17472
3  3 3089.60716
4  4 8448.89408
5  5 5307.75684
6  6   19.07485

要过滤掉异常值，我使用dplyr管道，我将多个操作与管道（%>%）运算符链接在一起。首先计算zscore，然后计算filter大于3的zscore的观察值，最后再次删除zscore列以恢复原始格式（当然，可以保留它）：

xclean <- x %>% mutate(zscore=(value-mean(value)) / sd(value)) %>%
 filter(zscore < 3) %>% select(-matches('zscore'))

如果查看行，您会看到过滤有效

> cat('Rows of X:',nrow(x),'- Rows of xclean:',nrow(xclean))
Rows of X: 100000 - Rows of xclean: 99575

虽然数据看起来像原始data.frame：

> head(xclean)
  ID      value
1  1  853.67941
2  2  632.17472
3  3 3089.60716
4  4 8448.89408
5  5 5307.75684
6  6   19.07485

最后，您可以通过比较两个ID的{{1}}来看到已过滤掉观察结果：

data.frame

编辑：

当然，2个数据帧版本也是可能的：

> head(x$ID[!is.element(x$ID,xclean$ID)],50)
 [1]    68    90   327   467   750   957  1090  1584  1978  2106  2306  3415  3511  3801  3855  4051
[17]  4148  4244  4266  4511  4875  5262  5633  5944  5975  6116  6263  6631  6734  6773  7320  7577
[33]  7619  7731  7735  7889  8073  8141  8207  8966  9200  9369  9994 10123 10538 11046 11090 11183
[49] 11348 11371

EDIT2：

计算多个值列：

y <- x

# calculate zscore
x$value <- (x$value - mean(x$value))/sd(x$value)

#subset y
y <- y[x$value<3,]

# initially 100k rows
> nrow(y)
[1] 99623

#generate data set.seed(21) x <- data.frame(ID=1:100000,value1=runif(100000,max=100)*runif(10000,max=100), value2=runif(100000,max=100)*runif(10000,max=100), value3=runif(100000,max=100)*runif(10000,max=100)) > head(x) ID value1 value2 value3 1 1 2103.9228 5861.33650 713.885222 2 2 341.8342 3940.68674 578.072141 3 3 5346.2175 458.07089 1.577347 4 4 400.1950 5881.05129 3090.618355 5 5 7346.3321 4890.56501 8989.248186 6 6 5305.5105 38.93093 517.509465解决方案：

dplyr

现在没有# make sure you got a recent version of dplyr > packageVersion('dplyr') [1] ‘0.7.2’ # define zscore function: zscore <- function(x){(x-mean(x))/sd(x)} # select variables (could also be manually with c()) vars_to_process <- grep('value',colnames(x),value=T) # calculate zscores and filter xclean <- x %>% mutate_at(.vars=vars_to_process, .funs=funs(ZS = zscore(.))) %>% filter_at(vars(matches('ZS')),all_vars(.<3)) %>% select(-matches('ZS')) > nrow(xclean) [1] 98832的解决方案（而不是使用2个数据帧，我将根据dplyr生成一个布尔索引：

如何从非缩放数据帧中删除针对缩放数据帧的数据

1 个答案:

编辑：

EDIT2：