选择唯一值而不重复列

时间:2018-10-31 15:41:23

标签: r dataframe

这是一个非常具体的问题:几年来,我从许多主题中获得了一系列观察结果(每年仅观察一次)。我只想为每个人选择一个观测值(我不在乎从哪一年开始),这样我每年都会得到类似数量的观测值,并且尽可能地随机。

因此,从df开始,其中1年是对该个人的观察,而0年是对该个人没有观察的年份:

df <- data.frame(Ind   = c("a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k"),
             Year1 = c(1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0), 
             Year2 = c(0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0), 
             Year3 = c(1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1), 
             Year4 = c(0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1))

类似于

df example View

我想结束这样的事情

df2 example View

编辑:尝试应用建议的解决方案(失败)

(1)earch的答案:

df <- as_tibble(df)

year.weights <- df %>% 
  gather(Year, Obs, -Ind) %>% 
  group_by(Year) %>% 
  summarize(wt = sum(Obs)) %>% 
  ungroup


df %>% 
      gather(Year, Obs, -Ind) %>%
      filter(Obs == 1) %>% 
      left_join(year.weights, by = "Year") %>% 
      group_by(Ind) %>% 
      sample_n(1, weight = 1 / wt) %>% 
      select(-wt) %>% 
      spread(Year, Obs) %>% 
      ungroup

这给出了错误Error: 'by' can't contain join column 'Year' which is missing from RHS,该错误出现在left_join步骤中。我试图通过为RHS中的唯一变量指定名称“ Year”来解决此问题

names(year.weights) <- "Year"

但是现在这给出了一个新错误:Error in left_join_impl(x, y, by_x, by_y, aux_x, aux_y, na_matches) : Can't join on 'Year' x 'Year' because of incompatible types (numeric / character)实际上很有意义,因为LHS中的Year列包含Year1,Year2,Year3等,而RHS中的Year列包含数字27。

据我所知,这是因为我看不到earch想要完成的工作,但是我确实相信可以通过n_sample和weight参数来实现可行的解决方案,但是我还不太清楚。

(2)Mikey的答案:

这很好用(我没有收到以前遇到的错误),但不能保证每个“年”列的值都等于(或类似)1。

因此,如果我多次运行该代码进行测试,则会得到:

# first time
      [,1] [,2] [,3] [,4]
 [1,]    0    0    0    1
 [2,]    1    0    0    0
 [3,]    0    0    1    0
 [4,]    0    1    0    0
 [5,]    1    0    0    0
 [6,]    0    0    1    0
 [7,]    0    0    0    1
 [8,]    0    1    0    0
 [9,]    0    0    0    1
[10,]    0    0    0    1
[11,]    0    0    0    1

# second time
      [,1] [,2] [,3] [,4]
 [1,]    1    0    0    0
 [2,]    1    0    0    0
 [3,]    0    0    1    0
 [4,]    0    1    0    0
 [5,]    0    0    0    1
 [6,]    1    0    0    0
 [7,]    1    0    0    0
 [8,]    0    0    0    1
 [9,]    0    0    0    1
[10,]    0    0    0    1
[11,]    0    0    1    0

(3)安德烈·埃里科(Andre Elrico)的答案:

它有与答案(2)相同的问题,它不能保证每年的1相等:请参见两个随机输出:

# fist try
   Ind Year1 Year2 Year3 Year4
1    a    NA    NA    NA     1
2    b    NA    NA     1    NA
3    c    NA    NA     1    NA
4    d    NA     1    NA    NA
5    e     1    NA    NA    NA
6    f    NA    NA     1    NA
7    g     1    NA    NA    NA
8    h    NA    NA    NA     1
9    i    NA    NA    NA     1
10   j    NA    NA    NA     1
11   k    NA    NA     1    NA

# second try
   Ind Year1 Year2 Year3 Year4
1    a     1    NA    NA    NA
2    b     1    NA    NA    NA
3    c    NA    NA     1    NA
4    d    NA    NA     1    NA
5    e    NA     1    NA    NA
6    f    NA    NA    NA     1
7    g    NA    NA    NA     1
8    h    NA    NA    NA     1
9    i    NA    NA    NA     1
10   j    NA     1    NA    NA
11   k    NA    NA     1    NA

(4)paoloeusebi的答案与先前的问题相同。不保证每行所选1的数目相等:

# first try
   Ind Year1 Year2 Year3 Year4
1    a     1    NA    NA    NA
2    b    NA    NA    NA     0
3    c    NA    NA     1    NA
4    d    NA    NA    NA     0
5    e    NA    NA     1    NA
6    f    NA    NA    NA     1
7    g     1    NA    NA    NA
8    h    NA    NA     0    NA
9    i    NA    NA    NA     1
10   j    NA    NA    NA     1
11   k    NA    NA     1    NA

# second try
   Ind Year1 Year2 Year3 Year4
1    a    NA    NA    NA     1
2    b    NA     0    NA    NA
3    c    NA     1    NA    NA
4    d    NA    NA    NA     0
5    e    NA    NA    NA     1
6    f    NA     0    NA    NA
7    g    NA     0    NA    NA
8    h    NA    NA     0    NA
9    i    NA    NA     0    NA
10   j    NA    NA     0    NA
11   k    NA     0    NA    NA

4 个答案:

答案 0 :(得分:3)

如果您希望每个人的随机年份为1,那么这是dplyr / tidyr方法:

> df <- data.frame(Ind   = c("a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k"),
+                  Year1 = c(1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0), 
+                  Year2 = c(0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0), 
+                  Year3 = c(1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1), 
+                  Year4 = c(0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1))
> 
> year.weights <- df %>% 
+   gather(Year, Obs, -Ind) %>% 
+   group_by(Year) %>% 
+   summarize(wt = sum(Obs)) %>% 
+   ungroup
> 
> year.weights
# A tibble: 4 x 2
  Year     wt
  <chr> <dbl>
1 Year1     7
2 Year2     5
3 Year3     7
4 Year4     7
> 
> 
> df %>% 
+   gather(Year, Obs, -Ind) %>%
+   filter(Obs == 1) %>% 
+   left_join(year.weights, by = "Year") %>% 
+   group_by(Ind) %>% 
+   sample_n(1, weight = 1 / wt) %>% 
+   select(-wt) %>% 
+   spread(Year, Obs) %>% 
+   ungroup
# A tibble: 11 x 5
   Ind   Year1 Year2 Year3 Year4
   <fct> <dbl> <dbl> <dbl> <dbl>
 1 a         1    NA    NA    NA
 2 b        NA    NA     1    NA
 3 c        NA     1    NA    NA
 4 d         1    NA    NA    NA
 5 e        NA    NA     1    NA
 6 f         1    NA    NA    NA
 7 g        NA    NA    NA     1
 8 h        NA    NA    NA     1
 9 i        NA    NA    NA     1
10 j        NA     1    NA    NA
11 k        NA    NA    NA     1

答案 1 :(得分:1)

这是一些代码。也许不是那么优雅,但这是一个开始:

new_mat = function(df, max_iter = 100){
    ind_names <- df[,1]
    df <- df[,-1]
    n = NROW(df)
    k = NCOL(df)
    max_col = ceiling(n / k)
    resample = function(x, ...) x[sample.int(length(x), ...)]
    one_hot = function(i, n){
        x = double(n)
        x[i] = 1
        return (x)
        }
    counter = 0
    flag = TRUE
    while (flag && counter <= max_iter){
        counter = counter + 1
        out = matrix(0, n, k)
        weights = rep(max_col, k)
        index = sample(1:n)
        c2 = 0
        for (i in index){
            ind = which(df[i,] == 1)
            probs = weights[ind]
            if (max(probs) == 0)
                break
            out[i,] = one_hot(resample(ind, size = 1, prob = probs), k)
            weights = weights - out[i,]
            c2 = c2 + 1
            }
        if (c2 == length(index))
            flag = FALSE
        }
    if (flag)
        stop('No matrix found. Try again.')
    final <- cbind(ind_names, as.data.frame(out))
    names(final) <- c("ind", names(df))
    return (final)
    }

如果您还希望随机选择所选的列,则要求每列具有(大约)相同的出现次数会带来很大的问题。当并非所有行的每一列中都有观察值时,这尤其成问题。行与行之间存在隐含的依赖关系,这可能是不希望的。

基本上,这最终将列的被选择权重设置为零,一旦该列达到max_col或出现的最大次数之后,列的数量就不能大致相同。 (我借用了earch加权列的想法。)

如果出现问题(例如,无法为具有weight>0的下一行选择任何列),则将重新运行该过程,最大为max_iter,但是要进行行的顺序不同。

此方法的主要缺点是必须反复遍历所有行。考虑到您的限制,我不确定是否可以解决此问题。因此,如果您的数据帧非常大,则可能会花费较长的计算时间。但是在您提供的数据集上,该函数通常仅经过一遍就返回一个矩阵,最多不超过几遍。

答案 2 :(得分:0)

这里有一个解决方案,可以在3年内随机替换NA,每个受试者4个

for (i in 1:dim(df)[1]){
    df[i,c(sample(2:5,3))]<-NA
    }

答案 3 :(得分:0)

m   <- df[-1]
IND <- rowSums(m) > 0
m[] <- NA
m[cbind(which(IND),max.col(df[-1])[IND])] <- 1
cbind(df[1],m)

结果:

#   Ind Year1 Year2 Year3 Year4
#1    a     1    NA    NA    NA
#2    b    NA    NA     1    NA
#3    c    NA    NA     1    NA
#4    d    NA    NA     1    NA
#5    e    NA    NA     1    NA
#6    f     1    NA    NA    NA
#7    g    NA    NA    NA     1
#8    h    NA    NA    NA     1
#9    i    NA    NA    NA     1
#10   j    NA     1    NA    NA
#11   k    NA    NA     1    NA

如果您不希望将变量简单地堆积到全局环境中:

(function(df){
    m   <- df[-1]
    IND <- rowSums(m) > 0
    m[] <- NA
    m[cbind(which(IND),max.col(df[-1])[IND])] <- 1
    cbind(df[1],m)
})(df)   # run this n-times