假设我有两个数据框。第一个数据帧是著名的虹膜数据集:
> data(iris)
> print(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
.... (150 observations)
第二个数据集包含每个属性的均值:
library(dplyr)
iris.means <- iris %>%
group_by(Species) %>%
summarize(Sepal.Length = mean(Sepal.Length),
Sepal.Width = mean(Sepal.Width),
Petal.Length = mean(Petal.Length),
Petal.Width = mean(Petal.Width))
> print(iris.means)
# A tibble: 3 x 5
Species Sepal.Length Sepal.Width Petal.Length Petal.Width
<fct> <dbl> <dbl> <dbl> <dbl>
1 setosa 5.01 3.43 1.46 0.246
2 versicolor 5.94 2.77 4.26 1.33
3 virginica 6.59 2.97 5.55 2.03
是否有一种优雅的方法(左外)将可变光圈数据框连接到均值,以便与最近记录的属性相匹配?
显而易见的答案是:
赞:
iris <- iris %>% mutate(id = row_number())
iris.means <- iris %>%
group_by(Species) %>%
summarize(Sepal.Length.Mean = mean(Sepal.Length),
Sepal.Width.Mean = mean(Sepal.Width),
Petal.Length.Mean = mean(Petal.Length),
Petal.Width.Mean = mean(Petal.Width))
names(iris.means)[names(iris.means) == "Species"] <- "Species.Mean"
iris.crossjoin <- merge(iris, iris.means, all=TRUE)
交叉连接的数据集为每个原始虹膜记录包含三个记录。我们开始时有150条记录;我们现在有450:
> arrange(iris.crossjoin, id)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species id Species.Mean Sepal.Length.Mean Sepal.Width.Mean Petal.Length.Mean Petal.Width.Mean
1 5.1 3.5 1.4 0.2 setosa 1 setosa 5.006 3.428 1.462 0.246
2 5.1 3.5 1.4 0.2 setosa 1 versicolor 5.936 2.770 4.260 1.326
3 5.1 3.5 1.4 0.2 setosa 1 virginica 6.588 2.974 5.552 2.026
4 4.9 3.0 1.4 0.2 setosa 2 setosa 5.006 3.428 1.462 0.246
5 4.9 3.0 1.4 0.2 setosa 2 versicolor 5.936 2.770 4.260 1.326
6 4.9 3.0 1.4 0.2 setosa 2 virginica 6.588 2.974 5.552 2.026
.... (450 observations)
我们计算虹膜数据集中的每个记录与每个物种的均值之间的总距离。正如@AEF在以下评论中指出的,这是曼哈顿距离:
iris.crossjoin <- iris.crossjoin %>%
mutate(Sepal.Length.Delta = abs(Sepal.Length.Mean - Sepal.Length),
Sepal.Width.Delta = abs(Sepal.Width.Mean - Sepal.Width),
Petal.Length.Delta = abs(Petal.Length.Mean - Petal.Length),
Petal.Width.Delta = abs(Petal.Width.Mean - Petal.Width),
Delta.Sum = Sepal.Length.Delta + Sepal.Width.Delta + Petal.Length.Delta + Petal.Width.Delta)
然后我们可以从距离均值最近的交叉联接数据集中过滤记录:
iris.crossjoin <- iris.crossjoin %>% arrange(id, Delta.Sum) %>%
group_by(id) %>%
mutate(rank = rank(Delta.Sum, ties.method = "first"),
correct = Species == Species.Mean) %>%
filter(rank == 1)
输出看起来像这样:
> iris.crossjoin[c('id', 'Sepal.Length', 'Sepal.Width', 'Petal.Length', 'Petal.Width', 'Species', 'Species.Mean', 'correct')]
id Sepal.Length Sepal.Width Petal.Length Petal.Width Species Species.Mean correct
1 1 5.1 3.5 1.4 0.2 setosa setosa TRUE
2 2 4.9 3.0 1.4 0.2 setosa setosa TRUE
...
52 52 6.4 3.2 4.5 1.5 versicolor versicolor TRUE
53 53 6.9 3.1 4.9 1.5 versicolor virginica FALSE
54 54 5.5 2.3 4.0 1.3 versicolor versicolor TRUE
...
Species
列是实际值,Species.Mean
是预测列,基于曼哈顿与最近物种均值的距离。
有更好的方法吗?对于较小的数据集,Crossjoin很好,但看起来像是大规模的反模式。