我想使用几个选定的列来传播数据集,其中没有唯一标识符来标识行。为此,我使用公开的虹膜数据集。
我尝试过首先删除不需要的列,然后创建唯一值而不重复。后来在它上面应用了点差。
iris %>% select(-c(Sepal.Width, Petal.Length, Petal.Width)) %>% unique() %>%
spread(Species, Sepal.Length)
iris %>% select(-c(Sepal.Width, Petal.Length, Petal.Width)) %>% unique() %>%
spread(key=Species, value=Sepal.Length)
但它给出了以下重复的标识符错误:
错误:行的重复标识符(1,2,3,4,5,6,7,8,9,10, 11,12,13,14,15),(16,17,18,19,20,21,22,23,24,25,26,27, 28,29,30,31,32,33,34,35,36),(37,38,39,40,41,42,43,44, 45,46,47,48,49,50,51,52,53,54,55,56,57)
使用row_number()
创建了一个唯一标识符,以便在传播数据时使用,避免错误重复行消息。
iris %>% select(-c(Sepal.Width, Petal.Length, Petal.Width)) %>% unique() %>%
mutate(row = row_number()) %>% spread(Species, Sepal.Length)
给出了以下输出:
# row setosa versicolor virginica
# 1 1 5.1 NA NA
# 2 2 4.9 NA NA
# 3 3 4.7 NA NA
# ...
# 16 16 NA 7.0 NA
# 17 17 NA 6.4 NA
# 18 18 NA 6.9 NA
# ...
# 37 37 NA NA 6.3
# 38 38 NA NA 5.8
# 39 39 NA NA 7.1
然而,由于行号,有许多不期望的NA。我尝试删除row
数字,以便按预期获取值,但它没有实现。
iris %>% select(-c(Sepal.Width, Petal.Length, Petal.Width)) %>% unique() %>%
mutate(row = row_number()) %>% spread(Species, Sepal.Length, -row)
iris %>% select(-c(Sepal.Width, Petal.Length, Petal.Width)) %>% unique() %>%
mutate(row = row_number()) %>% spread(Species, Sepal.Length, -one_of(row))
预期产出:
tmp <- iris %>% select(-c(Sepal.Width, Petal.Length, Petal.Width)) %>% unique() %>%
mutate(row = row_number()) %>% spread(Species, Sepal.Length)
cbind(setosa=unique(tmp$setosa), versicolor=unique(tmp$versicolor), virginica=unique(tmp$virginica))
# setosa versicolor virginica
# [1,] 5.1 7.0 6.3
# [2,] 4.9 6.4 5.8
# [3,] 4.7 6.9 7.1
# [4,] 4.6 5.5 6.5
# [5,] 5.0 6.5 7.6
# [6,] 5.4 5.7 4.9
# [7,] 4.4 6.3 7.3
# [8,] 4.8 4.9 6.7
# [9,] 4.3 6.6 7.2
# [10,] 5.8 5.2 6.4
# [11,] 5.7 5.0 6.8
# [12,] 5.2 5.9 5.7
# [13,] 5.5 6.0 7.7
# [14,] 4.5 6.1 6.0
# [15,] 5.3 5.6 6.9
# [16,] 5.1 6.7 5.6
# [17,] 4.9 5.8 6.2
# [18,] 4.7 6.2 6.1
# [19,] 4.6 6.8 7.4
# [20,] 5.0 5.4 7.9
# [21,] 5.4 5.1 5.9
答案 0 :(得分:1)
library(dplyr)
library(tidyr)
tbl_df(iris) %>%
select(Species, Sepal.Length) %>% # select columns of interest
group_by(Species) %>% # for each value
mutate(id = row_number()) %>% # create a row identifier
spread(Species, Sepal.Length) # reshape dataset
# # A tibble: 50 x 4
# id setosa versicolor virginica
# * <int> <dbl> <dbl> <dbl>
# 1 1 5.1 7.0 6.3
# 2 2 4.9 6.4 5.8
# 3 3 4.7 6.9 7.1
# 4 4 4.6 5.5 6.3
# 5 5 5.0 6.5 6.5
# 6 6 5.4 5.7 7.6
# 7 7 4.6 6.3 4.9
# 8 8 5.0 4.9 7.3
# 9 9 4.4 6.6 6.7
# 10 10 4.9 5.2 7.2
# # ... with 40 more rows
要特别注意如何创建/使用行标识符。上面的代码只使用数据集的顺序。如果您以某种方式重新订购它,您将获得不同的行组合。检查以下代码:
tbl_df(iris) %>%
arrange(desc(Sepal.Length)) %>% # order your values descending
select(Species, Sepal.Length) %>% # select columns of interest
group_by(Species) %>% # for each value
mutate(id = row_number()) %>% # create a row identifier
spread(Species, Sepal.Length) # reshape dataset
# # A tibble: 50 x 4
# id setosa versicolor virginica
# * <int> <dbl> <dbl> <dbl>
# 1 1 5.8 7.0 7.9
# 2 2 5.7 6.9 7.7
# 3 3 5.7 6.8 7.7
# 4 4 5.5 6.7 7.7
# 5 5 5.5 6.7 7.7
# 6 6 5.4 6.7 7.6
# 7 7 5.4 6.6 7.4
# 8 8 5.4 6.6 7.3
# 9 9 5.4 6.5 7.2
# 10 10 5.4 6.4 7.2
# # ... with 40 more rows
arrange(desc.))
,与以前的差异,将确保您在顶行(降序)上有更高的值。