我有两个简单的数据框,包含“word”和“n”列,表示某个单词出现的频率。这是一个例子:
df1 <- data.frame(word=c("beautiful","nice","like","good"),n=c(400,378,29,10))
df2 <- data.frame(word=c("beautiful","nice","like","good","wonderful","awesome","sad","happy"),n=c(6000,20,5,150,300,26,17,195))
除df1
的字词外,df2
包含更多字词,因此df1
只是df2
的一小部分。
我找到了df1
和df2
中包含的字词。现在,如果特定单词包含在df1
中,我想从df2
中减去df2
的单词计数,这意味着我想执行以下操作:
df2$n - df1$n
df1$word
df2$word
我希望我的问题很明确。
我已经找到了df1中包含在df2
中的所有单词df1 %>% filter(df1$word %in% df2$word)
但是,基于df1中的单词必须也在df2
中然后只减去df2$n - df1$n
感谢您的帮助!
答案 0 :(得分:3)
使用merge
:
> df.tmp <- merge(df1, df2, by="word", all=TRUE)
> df.tmp$result <- df.tmp$n.y - df.tmp$n.x
> df.tmp
word n.x n.y result
1 beautiful 400 6000 5600
2 good 10 150 140
3 like 29 5 -24
4 nice 378 20 -358
5 awesome NA 26 NA
6 happy NA 195 NA
7 sad NA 17 NA
8 wonderful NA 300 NA
如果您只想要匹配的单词
> df.tmp <- merge(df1, df2, by="word")
> df.tmp$result <- df.tmp$n.y - df.tmp$n.x
> df.tmp
word n.x n.y result
1 beautiful 400 6000 5600
2 good 10 150 140
3 like 29 5 -24
4 nice 378 20 -358
答案 1 :(得分:2)
require(dplyr)
df1 %>%
inner_join(df2, by = 'word') %>%
mutate(diff = n.y - n.x) %>%
select(word, diff)
给出
word diff
1 beautiful 5600
2 nice -358
3 like -24
4 good 140
答案 2 :(得分:2)
以下是使用for循环和%in%
运算符的快速解决方案。
df2$diff <- NA
for (i in 1:nrow(df2)) {
if (df2$word[i] %in% df1$word[i]) {
df2$diff[i] <- df2$n[i] - df1$n[i]
}
}
df2
输出:
> df2
word n diff
1 beautiful 6000 5600
2 nice 20 -358
3 like 5 -24
4 good 150 140
5 wonderful 300 NA
6 awesome 26 NA
7 sad 17 NA
8 happy 195 NA
答案 3 :(得分:2)
这是一个矢量化基本解决方案,其中布尔乘法用于替换@Rob中for-lop中使用的if-then结构:
df2$n.adjusted <- df2$n - (df2$word %in% df1$word)* # zero if no match
df1$n[ match(df1$word, df2$word) ] # gets order correct
> df2
word n n.adjusted
1 beautiful 6000 5600
2 nice 20 -358
3 like 5 -24
4 good 150 140
5 wonderful 300 300
6 awesome 26 26
7 sad 17 17
8 happy 195 195
以下是我用来测试df1字的顺序与df2中的顺序不同且长度不是偶数倍的示例:
> df1 <-data.frame(word=c("nice","beautiful","like","good"),n=c(378,400,29,10))
> df2 <- data.frame(word=c("beautiful","nice","like","good","wonderful","awesome","sad"),n=c(6000,20,5,150,300,26,17))
>
> df1
word n
1 nice 378
2 beautiful 400
3 like 29
4 good 10
> df2
word n
1 beautiful 6000
2 nice 20
3 like 5
4 good 150
5 wonderful 300
6 awesome 26
7 sad 17
> df2$n.adjusted <- df2$n - (df2$word %in% df1$word)*df1$n[match(df1$word, df2$word)]
Warning message:
In (df2$word %in% df1$word) * df1$n[match(df1$word, df2$word)] :
longer object length is not a multiple of shorter object length
> df2
word n n.adjusted
1 beautiful 6000 5600
2 nice 20 -358
3 like 5 -24
4 good 150 140
5 wonderful 300 300
6 awesome 26 26
7 sad 17 17