假设我有一个数据框架,其中包含一些分类变量和一些列,这些列是字符串值。我想创建一个新列,对于每个行,如果分类列中的某些值匹配(或不匹配),则从其他行粘贴字符串值。这是一个玩具示例。
toy <- data.frame("id" = c(1,2,3,2), "year" = c(2000,2000,2004,2004), "words" = c("a b", "c d", "e b", "c d"))
如果要满足两个条件,我想创建一个从其他行的word_pool
列粘贴的变量words
:该行的id
值与当前行的id值不同并且该行的year
值小于当前行的年份值。
结果应该是
id year words word_pool
1 2000 a b
2 2000 c d
3 2004 e b a b c d
2 2004 c d a b
由于玩具示例中的年份不少于2000年,因此新列的前两行将为空白。由于重复了id
,因此最后一行的新列中的值只有“ a b”。
我尝试了各种apply
和group_by
的方法,但似乎没有一个完全符合要求。将不胜感激!
答案 0 :(得分:1)
我已经使用sqldf
和plyr
软件包来实现解决方案。尽管我不会将其称为一种优雅的解决方案,但它确实有效。希望看到其他人提供的更有效的解决方案。
library(sqldf)
toy <- data.frame("id" = c(1,2,3,2),
"year" = c(2000,2000,2004,2004),
"words" = c("a b", "c d", "e b", "c d"))
toy
# id year words
#1 1 2000 a b
#2 2 2000 c d
#3 3 2004 e b
#4 2 2004 c d
df <- sqldf('SELECT t1.*,t2.words AS word_pool FROM toy t1 LEFT JOIN toy t2
ON t1.year > t2.year AND
t1.words <> t2.words')
df
# id year words word_pool
#1 1 2000 a b <NA>
#2 2 2000 c d <NA>
#3 3 2004 e b a b
#4 3 2004 e b c d
#5 2 2004 c d a b
result <- plyr::ddply(df,c("id","year","words"),
function(dfx)paste(dfx$word_pool,
collapse = " "))
result
# id year words V1
#1 1 2000 a b NA
#2 2 2000 c d NA
#3 2 2004 c d a b
#4 3 2004 e b a b c d
答案 1 :(得分:0)
对于for and which,它必须像apply一样编写,而不能使用extern库
## Create data
toy <-
data.frame(
"id" = c(1, 2, 3, 2),
"year" = c(2000, 2000, 2004, 2004),
"words" = c("a b", "c d", "e b", "c d")
)
toy$word_pool <- 0
for (i in 1:length(toy)) {
# Recognize index from condition
condition_index <- which(toy$year[i] > toy$year
& toy$id[i] != toy$id)
# assign
if (length(condition_index) == 0){# case no index
toy$word_pool[i] = ""
}
else{# paste with collapse join array
toy$word_pool[i] = paste(toy$words[condition_index],
collapse = " ", sep = " ")
}
}
toy
# id year words word_pool
# 1 2000 a b
# 2 2000 c d
# 3 2004 e b a b c d
# 2 2004 c d a b