转换数据集

时间:2017-12-08 16:55:49

标签: r

我想将聚合数据集转换为新的派生数据集,其中包含与初始聚合相对应的各个实例。 我从R加载数据集Titanic并查看其数据框。我看到每个元组出现的频率都是聚合的。 (例如,20名女性成年船员在坠机事件中幸存下来)。我想通过用相应的非聚合元组替换每个元组来重建数据集(例如,元组“Crew,Female,Adult,Yes”的20倍)。 我知道如何聚合数据集,但我无法转换已经聚合的数据集。任何提示都将非常感激。

library(dplyr)
library(purrr)
library(tidyr)

# keep data with frequency > 0
T = data.frame(Titanic, stringsAsFactors = F) %>% filter(Freq > 0)

tbl_df(T) %>%                           # tbl_df() only used to produce a more readable output (i.e. print only a few rows)
  mutate(id = map(Freq, ~ 1:.)) %>%     # create a vector of ids from 1 to Freq for each row
  unnest(id)                            # expand the vector

# # A tibble: 2,201 x 6
#    Class    Sex    Age Survived  Freq    id
#   <fctr> <fctr> <fctr>   <fctr> <dbl> <int>
# 1    3rd   Male  Child       No    35     1
# 2    3rd   Male  Child       No    35     2
# 3    3rd   Male  Child       No    35     3
# 4    3rd   Male  Child       No    35     4
# 5    3rd   Male  Child       No    35     5
# 6    3rd   Male  Child       No    35     6
# 7    3rd   Male  Child       No    35     7
# 8    3rd   Male  Child       No    35     8
# 9    3rd   Male  Child       No    35     9
# 10   3rd   Male  Child       No    35    10
# # ... with 2,191 more rows

如果需要,您可以删除id列。我把它留在那里只是为了更容易看到这个过程是如何工作的。 您还可以检查新数据集的行数是否为2,201,它等于sum(T $ Freq)。因此,正如预期的那样,原始数据集的频率总和是新数据集的行数。

1 个答案:

答案 0 :(得分:0)

library(dplyr)
library(purrr)
library(tidyr)

# keep data with frequency > 0
T = data.frame(Titanic, stringsAsFactors = F) %>% filter(Freq > 0)

tbl_df(T) %>%                           # tbl_df() only used to produce a more readable output (i.e. print only a few rows)
  mutate(id = map(Freq, ~ 1:.)) %>%     # create a vector of ids from 1 to Freq for each row
  unnest(id)                            # expand the vector

# # A tibble: 2,201 x 6
#    Class    Sex    Age Survived  Freq    id
#   <fctr> <fctr> <fctr>   <fctr> <dbl> <int>
# 1    3rd   Male  Child       No    35     1
# 2    3rd   Male  Child       No    35     2
# 3    3rd   Male  Child       No    35     3
# 4    3rd   Male  Child       No    35     4
# 5    3rd   Male  Child       No    35     5
# 6    3rd   Male  Child       No    35     6
# 7    3rd   Male  Child       No    35     7
# 8    3rd   Male  Child       No    35     8
# 9    3rd   Male  Child       No    35     9
# 10   3rd   Male  Child       No    35    10
# # ... with 2,191 more rows

如果需要,您可以删除id列。我把它留在那里只是为了让它更容易看出这个过程是如何运作的。

您还可以检查新数据集的行数是否为2,201,等于sum(T$Freq)。因此,正如预期的那样,原始数据集的频率总和是新数据集的行数。