Question

我有一个包含动物ID和日期的大型数据集。该数据集中有两个组，但是没有分组变量，因此我必须根据它们似乎共有的日期来推断谁属于哪个组。

虚拟数据。

mydf<-data.frame( Date=sort(rep(seq(as.Date("2012/1/1"),as.Date("2012/1/4"), length.out = 4),5)), ID = c(1,2,3,4,5,5,6,7,8,9,1,2,3,4,5,6,7,8,9,10))

我遇到的另一个问题是，属于组1的ID会不时出现与组2相关联的日期，这是迄今为止我在分组时所做的所有尝试的结果。

我需要的是一个带有ID的输出和一个新的Group ID，像这样

1：5都同时出现在第一和第三，因此它们很可能是一组。 6:10出现在第二和第四位，很可能是第二组。

ID 5属于组1，因为即使在ID为6：9的第二个被观察到一次，在ID 1和2：4的第二次也被观察到，所以它最有可能属于组1。 / p>

我所有的尝试都失败了。谁能为此提供解决方案？

先谢谢了。

编辑：

我认为我们已经使用乔恩的kmeans解决方案（在下面的评论中）确定了解决方案：

mydf_wide <- mydf %>% 
select(ID, date) %>%
distinct(ID,date)%>% # 
mutate(x = 1) %>%
spread(date, x, fill = 0)


mydf_wide$clusters <- mydf_wide %>% 
kmeans(centers = 2) %>%
pluck("cluster")

但是我实际上发现kmeans方法并不是每次都正确。参见下文：

The groups where certain tags (ID) appear on the same day as each other are fairly easy to spot by eye. There are two groups, one is in the center, and the other group appears on either side. The clustering should be vertical by common dates as in Jon's answer below, but it is clustering across the entire date range. (Apologies for the messy axis labels)

k均值方法在其他组上也可以使用，但不能始终按常用日期进行分组。我认为聚类方法是明智的，但我想知道是否还有其他聚类方法可能比kmeans更好？

或者，滤波方法是否可以帮助减少背景噪声并帮助kmeans方法更可靠？

再次感谢您提供任何建议。

干杯。

Answer 1

我在这里的想法是，您只需将每个日期分配给一个组，然后取每个ID的组的平均值。然后，您可以从那里四舍五入到最接近的整数。在这种情况下，group中的平均ID == 5将是1.33

library(dplyr)
mydf %>% 
  mutate(group = case_when(
    Date %in% as.Date(c("2012-01-01", "2012-01-03")) ~ 1,
    Date %in% as.Date(c("2012-01-02", "2012-01-04")) ~ 2,
    TRUE                                    ~ NA_real_
  )) %>% 
  group_by(ID) %>% 
  summarise(likely_group = mean(group) %>% round)

哪些可以给您以下内容？

# A tibble: 10 x 2
      ID likely_group
   <dbl>        <dbl>
 1     1            1
 2     2            1
 3     3            1
 4     4            1
 5     5            1
 6     6            2
 7     7            2
 8     8            2
 9     9            2
10    10            2

只要单个ID的组之间没有平均分配，此方法就起作用。但是，目前还没有一种方法可以通过提供的信息来解决这种情况。

Answer 2

作为一般解决方案，您可以考虑使用k-means作为基于与其他ID的相似性将数据分成组的自动方法。

首先，我将数据转换为宽格式，以便每个ID都获得一行。然后将其输入基本的kmeans函数中以将聚类输出作为列表，而将purrr::pluck抽取到该列表的分配部分。

library(tidyverse)
mydf_wide <- mydf %>% 
  mutate(x = 1) %>%
  spread(Date, x, fill = 0)

mydf_wide
 #   ID 2012-01-01 2012-01-02 2012-01-03 2012-01-04
 #1   1          1          0          1          0
 #2   2          1          0          1          0
 #3   3          1          0          1          0
 #4   4          1          0          1          0
 #5   5          1          1          1          0
 #6   6          0          1          0          1
 #7   7          0          1          0          1
 #8   8          0          1          0          1
 #9   9          0          1          0          1
 #10 10          0          0          0          1

clusters <- mydf_wide %>% 
  kmeans(centers = 2) %>%
  pluck("cluster")

clusters
 # [1] 2 2 2 2 2 1 1 1 1 1

如果将这些内容添加到原始数据并绘图，则显示的内容如下。

mydf_wide %>%
  mutate(cluster = clusters) %>%

  # ggplot works better with long (tidy) data...
  gather(date, val, -ID, -cluster) %>%
  filter(val != 0) %>%
  arrange(cluster) %>%

  ggplot(aes(date, ID, color = as.factor(cluster))) + 
  geom_point(size = 5) +
  scale_y_continuous(breaks = 1:10, minor_breaks = NULL) +
  scale_color_discrete(name = "cluster")

根据常见日期创建组变量

2 个答案: