Question

我想基于r中的条件对data.frame进行子集化。我有以下data.frame：

df

id     |    message      |     cluster
-------+-----------------+----------------
1      | Test A          | 1
2      | Test B          | 1
3      | Test C          | 3
4      | Test D          | 1
5      | Test E          | 2 
6      | Test F          | 2
7      | Test G          | 3
8      | Test H          | 3
9      | Test I          | 1 
10     | Test K          | 2
11     | Test L          | 4
12     | Test M          | 4

我想用4（不同簇的数量）行构造一个新的data.frame。我选择第一个message作为群集的代表。所以我想得到以下data.frame：

df2

id     |    message      |     cluster
-------+-----------------+----------------
1      | Test A          | 1
3      | Test C          | 3
5      | Test E          | 2 
11     | Test L          | 4

Answer 1

作为一种替代方法，dplyr包适用于这类事情。

text <- "id     |    message      |     cluster
1      | Test A          | 1
2      | Test B          | 1
3      | Test C          | 3
4      | Test D          | 1
5      | Test E          | 2
6      | Test F          | 2
7      | Test G          | 3
8      | Test H          | 3
9      | Test I          | 1
10     | Test K          | 2
11     | Test L          | 4
12     | Test M          | 4"

library(readr)
df <- read_delim(text, delim = "|", trim_ws=TRUE) 

library(dplyr)
df2 <-
    df %>% 
    group_by(cluster) %>%
    summarize(message=first(message))

结果如下：

> df2
# A tibble: 4 x 2
  cluster message
    <int>   <chr>
1       1  Test A
2       2  Test E
3       3  Test C
4       4  Test L

（对arrange数据可能有用，这样＆＃34;第一个＆＃34;是可预测的。）

Answer 2

获取要收集的行的索引：

indices <- !duplicated(df$cluster)

使用它来对数据帧进行子集化：

df2 <- df[indices, ]

根据R中的条件子集data.frame

2 个答案: