Question

我有一个长格式的数据集，我似乎无法使其成为正确的分析形状。也许这种形状是合适的 - 我的经验几乎完全是宽格式数据，所以这个数据文件对我来说没有意义。（帖子末尾的可重复数据文件。）

> head(df,10)
    ID attributes values
1   1         AU    AAA
2   1         AU    BBB
3   1         YR   2014
4   2         AU    CCC
5   2         AU    DDD
6   2         AU    EEE
7   2         AU    FFF
8   2         AU    GGG
9   2         YR   2013
10  3         AU    HHH

属性列包含我感兴趣的变量，我想执行一系列聚合函数。例如，我想：

1.获得每个ID的作者数量（AU）。例如：

   ID       N.AU
    1           2
    2           5
    3           1
    4           2
    5           5
    6           1

按年度（YR）计算作者的中位数（AU）

YR           Median.N.AU   
2013          5.0
2014          1.5

对于这两个例子，我尝试过dplry with group_by和summary，但还没有破解代码。我也试过dcast。我希望能够提出一个解决方案，我可以轻松地将其推广到更大的数据框架，该框架具有更多属性，可以采用单个值或多个值。任何帮助或指向类似解决方案的指针都将不胜感激。

attributes = c("AU", "AU", "YR", "AU", "AU", "AU", "AU", "AU", "YR", "AU", "YR",
   "AU", "AU", "YR", "AU", "AU", "AU", "AU", "AU", "YR", "AU", "YR")
ID = c(1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 4, 4, 4, 5, 5, 5, 5, 5, 5, 6, 6)
values = c("AAA", "BBB", "2014", "CCC", "DDD", "EEE", "FFF", "GGG", "2013", "HHH", "2014",
   "III", "JJJ", "2014", "KKK", "LLL", "MMM", "NNN", "OOO", "2013", "PPP", "2014")
df <- data.frame(ID, attributes, values)

Answer 1

我觉得你很困惑，因为你实际上有两张桌子通过公共ID链接的数据：

library(dplyr)
df <- tbl_df(df)

years <- df %>% 
  filter(attributes == "YR") %>% 
  select(id = ID, year = values)
years
#> Source: local data frame [6 x 2]
#> 
#>    id year
#> 1   1 2014
#> 2   2 2013
#> 3   3 2014
#> 4   4 2014
#> 5   5 2013
#> .. ..  ...

authors <- df %>% 
  filter(attributes == "AU") %>% 
  select(id = ID, author = values)
authors
#> Source: local data frame [16 x 2]
#> 
#>    id author
#> 1   1    AAA
#> 2   1    BBB
#> 3   2    CCC
#> 4   2    DDD
#> 5   2    EEE
#> .. ..    ...

获得此表单中的数据后，您可以轻松回答问题你对以下内容感兴趣：

每篇论文作者：

n_authors <- authors %>% 
  group_by(id) %>% 
  summarise(n = n())

或者

n_authors <- authors %>% count(id)

每年中位数作者：

n_authors %>%
  left_join(years) %>%
  group_by(year) %>%
  summarise(median(n))
#> Joining by: "id"
#> Source: local data frame [2 x 2]
#> 
#>   year median(n)
#> 1 2013       5.0
#> 2 2014       1.5

Answer 2

这是一个可能的data.table解决方案

我还建议使用分隔列创建一些聚合数据集。例如：

library(data.table)
(subdf <- as.data.table(df)[, .(N.AU = sum(attributes == "AU"),
                                Year = values[attributes == "YR"]) , ID])
#    ID N.AU Year
# 1:  1    2 2014
# 2:  2    5 2013
# 3:  3    1 2014
# 4:  4    2 2014
# 5:  5    5 2013
# 6:  6    1 2014

计算每年的中位数

subdf[, .(Median.N.AU = median(N.AU)), keyby = Year]
#    Year Median.N.AU
# 1: 2013         5.0
# 2: 2014         1.5

Answer 3

我最初误解了数据集的结构。感谢下面的评论，我意识到您的数据需要重组。

# split the data out
df1 <- df[df$attributes == "AU",]
df2 <- df[df$attributes == "YR",]

# just keeping the columns with data as opposed to the label
df3 <- merge(df1, df2, by="ID")[,c(1,3,5)]
# set column names for clarification
colnames(df3) <- c("ID","author","year")

# get author counts
num.authors <- count(df3, vars=c("ID","year"))
  ID year freq
1  1 2014    2
2  2 2013    5
3  3 2014    1
4  4 2014    2
5  5 2013    5
6  6 2014    1

summaryBy(freq ~ year, data = num.authors, FUN = list(median))
  year freq.median
1 2013         5.0
2 2014         1.5

关于summaryBy的好处是，您可以添加已在列表中定义的任何函数，并且您将获得包含其他度量的另一列（例如，mean，sd等）

重塑R中的数据长数据还是聚合？

3 个答案: