如何在R的一列中添加具有不同值的新行

时间:2018-07-24 15:37:23

标签: r dplyr

基本上,我具有所有名称的向量names,以及具有BIN(0/1)字段和NAME字段的数据帧df。对于每个带有BIN==0的行,我想创建一个重复的行,但要替换为1,然后将其添加到df的底部,并使用不同的名称。给定当前名称,这是我必须选择的新名称:

sample(names[names!=name], 1)

但是我不确定如何将其矢量化,并使用来自BIN的相同数据将其添加到df

编辑: 样本数据:

df = data.frame(BIN=c(1,0,1), NAME=c("alice","bob","cate"))
names = c("alice","bob","cate","dan")

我越来越喜欢这样的东西:

rbind(df, df %>% filter(BIN == 0) %>%
    mutate(NAME = sample(names[names!=NAME],1)))

但是我得到一个错误:在binattr(e1,e2)中:length(e1)不是length(e2)的倍数。

3 个答案:

答案 0 :(得分:1)

这是一种简单的方法。我认为这很简单,如果您有任何问题,请告诉我:

rename = subset(df, BIN == 0)
rename$NEW_NAME = sample(names, size = nrow(rename), replace = TRUE)
while(any(rename$NAME == rename$NEW_NAME)) {
  matches = rename$NAME == rename$NEW_NAME
  rename$NEW_NAME[matches] = sample(names, size = sum(matches), replace = TRUE)
}
rename$BIN = 1
rename$NAME = rename$NEW_NAME
rename$NEW_NAME = NULL

result = rbind(df, rename)
result
#    BIN  NAME
# 1    1 alice
# 2    0   bob
# 3    1  cate
# 21   1 alice

这是另一种方法,虽然不清楚,但效率更高。这是这样做的“正确”方法,但是还需要更多的思考和解释。

df$NAME = factor(df$NAME, levels = names)
rename = subset(df, BIN == 0)
n = length(names)
# we will increment each level number with a random integer from
# 1 to n - 1 (with a mod to make it cyclical)
offset = sample(1:(n - 1), size = nrow(rename), replace = TRUE)
adjusted = (as.integer(rename$NAME) + offset) %% n
# reconcile 1-indexed factor levels with 0-indexed mod operator
adjusted[adjusted == 0] = n
rename$NAME = names[adjusted]
rename$BIN = 1
result = rbind(df, rename)

(或为dplyr重写)

df = mutate(df, NAME = factor(NAME, levels = names))
n = length(names)
df %>% filter(BIN == 0) %>%
  mutate(
    offset = sample(1:(n - 1), size = n(), replace = TRUE),
    adjusted = (as.integer(NAME) + offset) %% n,
    adjusted = if_else(adjusted == 0, n, adjusted),
    NAME = names[adjusted],
    BIN = 1
  ) %>%
  select(-offset, -adjusted) %>%
  rbind(df, .)

由于您的问题是矢量化部分,因此建议您对一个具有多个BIN 0行的示例案例测试答案,因此我使用了此方法:

df = data.frame(BIN=c(1,0,1,0,0,0,0,0,0), NAME=rep(c("alice","bob","cate"), 3))

而且,因为我很好奇,所以这是一个包含26个名称的1万行的基准。结果优先,代码如下:

# Unit: milliseconds
#             expr        min         lq      mean     median         uq        max neval
#       while_loop  34.070438  34.327020  37.53357  35.548047  39.922918  46.206454    10
#        increment   1.397617   1.458592   1.88796   1.526512   2.123894   3.196104    10
#  increment_dplyr  24.002169  24.681960  25.50568  25.374429  25.750548  28.054954    10
#         map_char 346.531498 347.732905 361.82468 359.736403 374.648635 383.575265    10

到目前为止,“明智”的方式是最快的。我的猜测是dplyr的速度下降是因为我们不能仅直接替换adjusted的相关位,而是不得不增加if_else的开销。这样,我们实际上是在adjustedoffset的数据帧中添加列,而不是处理向量。这足以使它几乎与while循环方法一样慢,后者仍然比map_chr快一倍。nn = 10000 df = data.frame( BIN = sample(0:1, size = nn, replace = TRUE, prob = c(0.7, 0.3)), NAME = factor(sample(letters, size = nn, replace = TRUE), levels = letters) ) get.new.name <- function(c){ return(sample(names[names!=c],1)) } microbenchmark::microbenchmark( while_loop = { rename = subset(df, BIN == 0) rename$NEW_NAME = sample(names, size = nrow(rename), replace = TRUE) while (any(rename$NAME == rename$NEW_NAME)) { matches = rename$NAME == rename$NEW_NAME rename$NEW_NAME[matches] = sample(names, size = sum(matches), replace = TRUE) } rename$BIN = 1 rename$NAME = rename$NEW_NAME rename$NEW_NAME = NULL result = rbind(df, rename) }, increment = { rename = subset(df, BIN == 0) n = length(names) # we will increment each level number with a random integer from # 1 to n - 1 (with a mod to make it cyclical) offset = sample(1:(n - 1), size = nrow(rename), replace = TRUE) adjusted = (as.integer(rename$NAME) + offset) %% n # reconcile 1-indexed factor levels with 0-indexed mod operator adjusted[adjusted == 0] = n rename$NAME = names[adjusted] rename$BIN = 1 }, increment_dplyr = { n = length(names) df %>% filter(BIN == 0) %>% mutate( offset = sample(1:(n - 1), size = n(), replace = TRUE), adjusted = (as.integer(NAME) + offset) %% n, adjusted = if_else(adjusted == 0, n, adjusted), NAME = names[adjusted], BIN = 1 ) %>% select(-offset,-adjusted) }, map_char = { new.df <- df %>% filter(BIN == 0) %>% mutate(NAME = map_chr(NAME, get.new.name)) %>% mutate(BIN = 1) }, times = 10 ) 每次必须连续一行。

{{1}}

答案 1 :(得分:0)

有点奇怪,但是我认为这应该是您想要的:

library(tidyverse)

df <- data.frame(BIN=c(1,0,1), NAME=c("alice","bob","cate"), stringsAsFactors = FALSE)
names <- c("alice","bob","cate","dan")

df %>% 
  mutate(NAME_new = ifelse(BIN == 0, sample(names, n(), replace = TRUE), NA)) %>% 
  gather(name_type, NAME, NAME:NAME_new, na.rm = TRUE) %>% 
  mutate(BIN = ifelse(name_type == "NAME_new", 1, BIN)) %>% 
  select(-name_type)

输出:

  BIN  NAME
1   1 alice
2   0   bob
3   1  cate
4   1 alice

答案 2 :(得分:0)

好吧,我不想回答自己的问题,但我确实找到了一个更简单的解决方案。我认为这比使用rowwise()更好,但我不知道这是否一定是最有效的方法。

library(tidyverse)

get.new.name <- function(c){
    return(sample(names[names!=c],1))
}

new.df <- rbind(df, df %>% filter(BIN == 0) %>%
    mutate(NAME = map_chr(NAME, get.new.name)) %>%
    mutate(BIN = 1)

map_char变得非常重要,而不仅仅是map,因为后者会返回一个奇怪的列表列表。