我有两个数据框, DF1:
df1 <- c("A large bunch of purple grapes", "large green potato sack", "small red tomatoes", "yellow and black bananas")
df1 <- data.frame(df1)
DF2:
Word <- c("green", "purple", "grapes", "small", "sack", "yellow", "bananas", "large)
Rank <- c(20,18,22,16,15,17,6,12)
df2 <- data.frame(Word,Rank)
DF1:
ID Sentence
1 A large bunch of purple grapes
2 large green potato sack
3 small red tomatoes
4 yellow and black bananas
DF2:
ID Word Rank
1 green 20
2 purple 18
3 grapes 22
4 small 16
5 Sack 15
6 yellow 17
7 bananas 6
8 large 12
我想做的是;将df2中的单词与&#34; Sentence&#34;中包含的单词相匹配列并在df1中插入一个新列,其中包含来自df2的排名最高的匹配单词。所以像这样:
DF1:
ID Sentence Word
1 A large bunch of purple grapes grapes
2 large green potato sack green
3 small red tomatoes small
4 yellow and black bananas yellow
我最初习惯使用以下代码来匹配单词,但当然这会创建一个包含所有匹配单词的列:
x <- sapply(df2$Word, function(x) grepl(tolower(x), tolower(df1$Sentence)))
df1$top_match <- apply(x, 1, function(i) paste0(names(i)[i], collapse = " "))
答案 0 :(得分:0)
我写了一个小片段(但有不同的变量名称)
> inp1
ID Word new_word
1 1 large green potato sack green
2 2 A large bunch of purple grapes grapes
3 3 yellow and black bananas yellow
>
> inp2
ID Word Rank
1 1 green 20
2 2 purple 18
3 3 grapes 22
4 4 small 16
5 5 Sack 15
6 6 yellow 17
7 7 bananas 6
8 8 large 12
>
> inp1$new_word <- lapply(inp1$Word, function(text){ inp2$Word[inp2$Rank == max(inp2$Rank[inp2$Word %in% unique(as.vector(str_match(text,inp2$Word)))])]})
>
> inp1
ID Word new_word
1 1 large green potato sack green
2 2 A large bunch of purple grapes grapes
3 3 yellow and black bananas yellow
>
答案 1 :(得分:0)
以下是tidyverse
+ stringr
解决方案:
library(tidyverse)
library(stringr)
df1$Sentence %>%
str_split_fixed(" ", Inf) %>%
as.data.frame(stringsAsFactors = FALSE) %>%
cbind(ID = rownames(df1), .) %>%
gather(word_count, Word, -ID) %>%
inner_join(df2, by = "Word") %>%
group_by(ID) %>%
filter(Rank == max(Rank)) %>%
select(ID, Word) %>%
right_join(rownames_to_column(df1, "ID"), by = "ID") %>%
select(ID, Sentence, Word)
<强>结果:强>
# A tibble: 4 x 3
# Groups: ID [4]
ID Sentence Word
<chr> <chr> <chr>
1 1 A large bunch of purple grapes grapes
2 2 large green potato sack green
3 3 small red tomatoes small
4 4 yellow and black bananas yellow
注意:强>
您可以忽略将ID从因素强制转换为字符的警告。我还修改了您的数据集,以包含df1
的正确列名,并禁止自动将字符强制转换为因子。
数据:强>
df1 <- c("A large bunch of purple grapes", "large green potato sack", "small red tomatoes", "yellow and black bananas")
df1 <- data.frame(Sentence = df1, stringsAsFactors = FALSE)
Word <- c("green", "purple", "grapes", "small", "sack", "yellow", "bananas", "large")
Rank <- c(20,18,22,16,15,17,6,12)
df2 <- data.frame(Word,Rank, stringsAsFactors = FALSE)