Question

我正在尝试获取data.frame并聚合一列的值，按其他列中的值排序，以及最终列中的值在特定条件之间的位置。在SQL中我会做一个简单的Group By并编写一个循环，但我刚开始使用R并且很难搞清楚语法。基本上我有一个如下所示的数据集：

Type    Type2   Bucket  Value
   A    1       1       1
   A    2       1       2
   A    3       1       1
   A    4       1       3
   A    5       1       1
   A    1       2       1
   A    2       2       2
   A    3       2       1
   A    4       2       3

我希望输出是这样的：

Type    Type2   Bucket  Value
A       <4      1       4
A       >=4     1       4
A       <4      2       5
A       >=4     2       3

在我的脑海中，这很容易，但我来自SQL背景，并试图在R中进行。我已经搞砸了一些函数，比如split和ddply，但有点成功。这一切都在一起。感谢。

Answer 1

您可以使用dplyr执行此操作。假设您有多个Type：

library(dplyr)

df %>%
  group_by(Type, Bucket, Type2 = ifelse(Type2 < 4, "<4", ">=4")) %>%
  summarize(Value = sum(Value)) %>%
  select(Type, Type2, Bucket, Value)

<强>结果：

# A tibble: 4 x 4
# Groups:   Type, Bucket [2]
    Type Type2 Bucket Value
  <fctr> <chr>  <int> <int>
1      A    <4      1     4
2      A   >=4      1     4
3      A    <4      2     4
4      A   >=4      2     3

由于您提到您拥有SQL背景，因此这里有一个sqldf解决方案：

library(sqldf)

sqldf("select Type, 
              case when Type2 < 4 then '<4' else '>=4' end as Type_2,
              Bucket, 
              sum(Value) as Value
          from df
          group by Type, Bucket, Type_2")

<强>结果：

  Type Type_2 Bucket Value
1    A     <4      1     4
2    A    >=4      1     4
3    A     <4      2     4
4    A    >=4      2     3

数据：

df = structure(list(Type = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "A", class = "factor"), Type2 = c(1L, 2L, 3L, 4L, 5L, 1L, 2L, 3L, 4L), Bucket = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L), Value = c(1L, 2L, 1L, 3L, 1L, 1L, 2L, 1L, 3L)), .Names = c("Type", "Type2", "Bucket", "Value"), class = "data.frame", row.names = c(NA, -9L))

R中的和值，两个数字之间的列中的值

1 个答案: