Question

我正在使用大型数据集，但让我们举一个玩具示例来演示我想要实现的目标。我正在使用R和dplyr。我有一张桌子：

id  attribute correct
1   a         a
1   b         a
1   c         a
2   d         e
2   e         e
3   d         f

从上面我想创建两列attribute_set和label。为了澄清，我想：

id  attribute_set   correct   label
1   a, b, c         a         1
2   d, e            e         1
3   d               f         0

attribute_set应该是具有id所有属性的集合（任何数据结构）。如果正确的值在label中，则attribute_set应为1，否则为0。

目前，我这样创建attribute_set：

design_mat1 <- design_mat %>%
  group_by(id) %>%
  mutate(attribute_set = paste(unique(attribute), collapse = "|")) %>%
  select(-attribute)

我这样生成label：

design_mat2b <- design_mat2 %>%
  group_by(id) %>%
  mutate(label = ifelse(correct %in% attribute_set, 1, 0))

但是，只有attribute_set中有一个元素时，我的标签才有效。我想我必须在strsplit上|或让attribute_set使用其他一些数据结构。我一直无法弄清楚使用什么替代数据结构，也无法让strsplit |解决方案工作。任何提示/解决方案都表示赞赏。

Answer 1

按'id'分组后，我们可以使用summarise paste'{1}}'属性'元素，同时选择unique或first如果'attribute'中有unique'正确'元素，则为'correct'和'label'的值

any

或者在library(dplyr) design_mat %>% group_by(id) %>% summarise(attribute_set = toString(unique(attribute)), correct = first(correct), label = +(any(correct %in% attribute))) # A tibble: 3 x 4 # id attribute_set correct label # <int> <chr> <chr> <int> #1 1 a, b, c a 1 #2 2 d, e e 1 #3 3 d f 0中使用'正确'，然后在'attribute_set'和'label'上使用group_by

R，dplyr：收集列的唯一值，根据集合交集变异标签

1 个答案: