Question

我有一个包含历史足球结果的非常大的数据集。以下是其中的一部分：

  Season              home          visitor  FT
    1954       Aston Villa              SHW 0-0
    1956       Aston Villa              SHW 5-0
    1957       Aston Villa              SHW 2-0
    1960       Aston Villa              SHW 4-1
    1987       Aston Villa              HUL 5-0
    1987       Aston Villa              HUD 1-1
    1987       Aston Villa              BLB 1-1
    1933 Preston North End              NOT 4-0
    1958 Preston North End              NOT 3-5
    1960 Preston North End              NOT 0-1
    1962 Preston North End              SWA 6-3
    1976           Walsall              SHW 5-1
    1977           Walsall              SHW 1-1
    2002           Walsall Sheffield United 0-1
    2002           Walsall       Gillingham 1-0

对于每个主队（因子），我希望返回该因子发生的另一个因子（赛季）的独特水平。在上面的例子中，这将返回：

Aston Villa - 1954, 1956, 1957, 1960, 1987
Preston North End - 1933, 1958, 1960, 1962
Walsall - 1976, 1977, 2002

我想在dplyr中尝试做这个练习。但是，我没有做到这一点。

我试过了：

library(dplyr)
demodf%>%
group_by(home)%>%
summarize(levels(Season))
#Error: expecting a single value

出于兴趣，我做了以下工作，看看我是否能看到每个因素/主队返回的第一年：

demodf%>%
group_by(home)%>%
summarize(levels(Season)[1])

这给了我这个：

#               home levels(Season)[1]
#1       Aston Villa              1933
#2 Preston North End              1933
#3           Walsall              1933

这是不对的 - 它刚刚在整个数据框（1933年）中返回了季节因素的第一级，而不是每个团队的第一年/季节因素水平 - 我认为{{1}本来可以帮助你做到这一点。

我很感激任何帮助。

以下内容可让您重现上表：

group.by

Answer 1

在这种情况下，您可以使用by：

with(demodf, by(Season, home, unique))
# home: Aston Villa
# [1] 1954 1956 1957 1960 1987
# Levels: 1933 1954 1956 1957 1958 1960 1962 1976 1977 1987 2002
# ------------------------------------------------------------ 
# home: Preston North End
# [1] 1933 1958 1960 1962
# Levels: 1933 1954 1956 1957 1958 1960 1962 1976 1977 1987 2002
# ------------------------------------------------------------ 
# home: Walsall
# [1] 1976 1977 2002
# Levels: 1933 1954 1956 1957 1958 1960 1962 1976 1977 1987 2002

“data.table”包还可以将list作为data.table中的列处理，如下所示：

library(data.table)
DT <- as.data.table(demodf)
DT[, list(Season = list(unique(Season))), by = home]
#                 home                   Season
# 1:       Aston Villa 1954,1956,1957,1960,1987
# 2: Preston North End      1933,1958,1960,1962
# 3:           Walsall           1976,1977,2002

注意结果的结构：

str(.Last.value)
# Classes ‘data.table’ and 'data.frame':  3 obs. of  2 variables:
#  $ home  : Factor w/ 3 levels "Aston Villa",..: 1 2 3
#  $ Season:List of 3
#   ..$ : Factor w/ 11 levels "1933","1954",..: 2 3 4 6 10
#   ..$ : Factor w/ 11 levels "1933","1954",..: 1 5 6 7
#   ..$ : Factor w/ 11 levels "1933","1954",..: 8 9 11
#  - attr(*, ".internal.selfref")=<externalptr>

Answer 2

将Season作为一个因素会使问题稍微复杂化，但

demodf %>% group_by(home) %>% do(data.frame(Seasons = unique(.$Season)))

会奏效。

请注意，使用unique代替levels

更为简单

Answer 3

我使用粘贴来模仿你想要的输出：

demodf%>%
  group_by(home)%>%
  summarise( summary =  paste(unique(Season),collapse=","))

给出了

               home                  summary
1       Aston Villa 1954,1956,1957,1960,1987
2 Preston North End      1933,1958,1960,1962
3           Walsall           1976,1977,2002

对于因子的所有级别，从同一数据帧返回另一个因子的所有级别 - 使用dplyr？ [R

3 个答案: