Question

我有两个数据框， DF1：

df1 <- c("A large bunch of purple grapes", "large green potato sack", "small red tomatoes", "yellow and black bananas")
df1 <- data.frame(df1)

DF2：

Word <- c("green", "purple", "grapes", "small", "sack", "yellow", "bananas", "large)

Rank <- c(20,18,22,16,15,17,6,12)

df2 <- data.frame(Word,Rank)

DF1：

ID      Sentence  
 1      A large bunch of purple grapes  
 2      large green potato sack 
 3      small red tomatoes  
 4      yellow and black bananas

DF2：

ID      Word      Rank
 1      green      20
 2      purple     18
 3      grapes     22
 4      small      16
 5      Sack       15
 6      yellow     17
 7      bananas    6
 8      large      12

我想做的是;将df2中的单词与＆＃34; Sentence＆＃34;中包含的单词相匹配列并在df1中插入一个新列，其中包含来自df2的排名最高的匹配单词。所以像这样：

DF1：

ID     Sentence                         Word
 1     A large bunch of purple grapes   grapes
 2     large green potato sack          green
 3     small red tomatoes               small
 4     yellow and black bananas         yellow

我最初习惯使用以下代码来匹配单词，但当然这会创建一个包含所有匹配单词的列：

x <- sapply(df2$Word, function(x) grepl(tolower(x), tolower(df1$Sentence)))

df1$top_match <- apply(x, 1, function(i) paste0(names(i)[i], collapse = " "))

Answer 1

我写了一个小片段（但有不同的变量名称）

> inp1 
  ID                           Word new_word
1  1        large green potato sack    green
2  2 A large bunch of purple grapes   grapes
3  3       yellow and black bananas   yellow
> 
> inp2
  ID    Word Rank
1  1   green   20
2  2  purple   18
3  3  grapes   22
4  4   small   16
5  5    Sack   15
6  6  yellow   17
7  7 bananas    6
8  8   large   12
> 
> inp1$new_word <- lapply(inp1$Word, function(text){ inp2$Word[inp2$Rank == max(inp2$Rank[inp2$Word %in% unique(as.vector(str_match(text,inp2$Word)))])]})
> 
> inp1
  ID                           Word new_word
1  1        large green potato sack    green
2  2 A large bunch of purple grapes   grapes
3  3       yellow and black bananas   yellow
>

Answer 2

以下是tidyverse + stringr解决方案：

library(tidyverse)
library(stringr)

df1$Sentence %>%
  str_split_fixed(" ", Inf) %>%
  as.data.frame(stringsAsFactors = FALSE) %>%
  cbind(ID = rownames(df1), .) %>%
  gather(word_count, Word, -ID) %>%
  inner_join(df2, by = "Word") %>%
  group_by(ID) %>%
  filter(Rank == max(Rank)) %>%
  select(ID, Word) %>%
  right_join(rownames_to_column(df1, "ID"), by = "ID") %>%
  select(ID, Sentence, Word)

<强>结果：

# A tibble: 4 x 3
# Groups:   ID [4]
     ID                       Sentence   Word
  <chr>                          <chr>  <chr>
1     1 A large bunch of purple grapes grapes
2     2        large green potato sack  green
3     3             small red tomatoes  small
4     4       yellow and black bananas yellow

注意：

您可以忽略将ID从因素强制转换为字符的警告。我还修改了您的数据集，以包含df1的正确列名，并禁止自动将字符强制转换为因子。

数据：

df1 <- c("A large bunch of purple grapes", "large green potato sack", "small red tomatoes", "yellow and black bananas") df1 <- data.frame(Sentence = df1, stringsAsFactors = FALSE) Word <- c("green", "purple", "grapes", "small", "sack", "yellow", "bananas", "large") Rank <- c(20,18,22,16,15,17,6,12) df2 <- data.frame(Word,Rank, stringsAsFactors = FALSE)

匹配最高排名的单词与数据框列R中的文本

2 个答案: