在没有任何标识符的情况下在所选列上传播数据集

时间:2017-11-17 12:48:52

标签: r dplyr tidyr

我想使用几个选定的列来传播数据集,其中没有唯一标识符来标识行。为此,我使用公开的虹膜数据集。

我尝试过首先删除不需要的列,然后创建唯一值而不重复。后来在它上面应用了点差。

iris %>% select(-c(Sepal.Width, Petal.Length, Petal.Width)) %>% unique() %>% 
  spread(Species, Sepal.Length)
iris %>% select(-c(Sepal.Width, Petal.Length, Petal.Width)) %>% unique() %>% 
  spread(key=Species, value=Sepal.Length)

但它给出了以下重复的标识符错误:

  

错误:行的重复标识符(1,2,3,4,5,6,7,8,9,10,   11,12,13,14,15),(16,17,18,19,20,21,22,23,24,25,26,27,   28,29,30,31,32,33,34,35,36),(37,38,39,40,41,42,43,44,   45,46,47,48,49,50,51,52,53,54,55,56,57)

使用row_number()创建了一个唯一标识符,以便在传播数据时使用,避免错误重复行消息。

iris %>% select(-c(Sepal.Width, Petal.Length, Petal.Width)) %>% unique() %>%
  mutate(row = row_number()) %>% spread(Species, Sepal.Length)

给出了以下输出:

#    row setosa versicolor virginica
# 1    1    5.1         NA        NA
# 2    2    4.9         NA        NA
# 3    3    4.7         NA        NA
# ...
# 16  16     NA        7.0        NA
# 17  17     NA        6.4        NA
# 18  18     NA        6.9        NA
# ...
# 37  37     NA         NA       6.3
# 38  38     NA         NA       5.8
# 39  39     NA         NA       7.1

然而,由于行号,有许多不期望的NA。我尝试删除row数字,以便按预期获取值,但它没有实现。

iris %>% select(-c(Sepal.Width, Petal.Length, Petal.Width)) %>% unique() %>% 
  mutate(row = row_number()) %>%  spread(Species, Sepal.Length, -row)

iris %>% select(-c(Sepal.Width, Petal.Length, Petal.Width)) %>% unique() %>% 
  mutate(row = row_number()) %>%  spread(Species, Sepal.Length, -one_of(row))

预期产出:

tmp <- iris %>% select(-c(Sepal.Width, Petal.Length, Petal.Width)) %>% unique() %>%
  mutate(row = row_number()) %>% spread(Species, Sepal.Length)

cbind(setosa=unique(tmp$setosa), versicolor=unique(tmp$versicolor), virginica=unique(tmp$virginica))
#       setosa versicolor virginica
#  [1,]    5.1        7.0       6.3
#  [2,]    4.9        6.4       5.8
#  [3,]    4.7        6.9       7.1
#  [4,]    4.6        5.5       6.5
#  [5,]    5.0        6.5       7.6
#  [6,]    5.4        5.7       4.9
#  [7,]    4.4        6.3       7.3
#  [8,]    4.8        4.9       6.7
#  [9,]    4.3        6.6       7.2
# [10,]    5.8        5.2       6.4
# [11,]    5.7        5.0       6.8
# [12,]    5.2        5.9       5.7
# [13,]    5.5        6.0       7.7
# [14,]    4.5        6.1       6.0
# [15,]    5.3        5.6       6.9
# [16,]    5.1        6.7       5.6
# [17,]    4.9        5.8       6.2
# [18,]    4.7        6.2       6.1
# [19,]    4.6        6.8       7.4
# [20,]    5.0        5.4       7.9
# [21,]    5.4        5.1       5.9

1 个答案:

答案 0 :(得分:1)

library(dplyr)
library(tidyr)

tbl_df(iris) %>%
  select(Species, Sepal.Length) %>%       # select columns of interest
  group_by(Species) %>%                   # for each value
  mutate(id = row_number()) %>%           # create a row identifier
  spread(Species, Sepal.Length)           # reshape dataset

# # A tibble: 50 x 4
#       id setosa versicolor virginica
#  * <int>  <dbl>      <dbl>     <dbl>
# 1     1    5.1        7.0       6.3
# 2     2    4.9        6.4       5.8
# 3     3    4.7        6.9       7.1
# 4     4    4.6        5.5       6.3
# 5     5    5.0        6.5       6.5
# 6     6    5.4        5.7       7.6
# 7     7    4.6        6.3       4.9
# 8     8    5.0        4.9       7.3
# 9     9    4.4        6.6       6.7
# 10    10   4.9        5.2       7.2
# # ... with 40 more rows

要特别注意如何创建/使用行标识符。上面的代码只使用数据集的顺序。如果您以某种方式重新订购它,您将获得不同的行组合。检查以下代码:

tbl_df(iris) %>%
  arrange(desc(Sepal.Length)) %>%         # order your values descending
  select(Species, Sepal.Length) %>%       # select columns of interest
  group_by(Species) %>%                   # for each value
  mutate(id = row_number()) %>%           # create a row identifier
  spread(Species, Sepal.Length)           # reshape dataset

# # A tibble: 50 x 4
#      id setosa versicolor virginica
# * <int>  <dbl>      <dbl>     <dbl>
# 1     1    5.8        7.0       7.9
# 2     2    5.7        6.9       7.7
# 3     3    5.7        6.8       7.7
# 4     4    5.5        6.7       7.7
# 5     5    5.5        6.7       7.7
# 6     6    5.4        6.7       7.6
# 7     7    5.4        6.6       7.4
# 8     8    5.4        6.6       7.3
# 9     9    5.4        6.5       7.2
# 10    10   5.4        6.4       7.2
# # ... with 40 more rows

arrange(desc.)),与以前的差异,将确保您在顶行(降序)上有更高的值。