寻找从文本中提取关键字的一些帮助。我有两个数据框。第一个数据框有描述列,另一个数据框只有一列包含关键字。
我想在描述字段中搜索dataframe2中的关键字,并在dataframe1中使用匹配的关键字创建一个新列。如果有多个关键字,我需要新添加的列,其中所有关键字都用逗号分隔,如下所述。
Dataframe2
Keywords
New
FUND
EVENT
Author
book
Dataframe1
ID NAME Month DESCRIPTION Keywords
12 x1 Jan funding recived fund
23 x2 Feb author of the book author, book
14 x3 Mar new year event new, event
另外,我需要关键字,即使描述有完整的单词。 I.efnding我可以在新专栏中获得关键字基金。
答案 0 :(得分:4)
我们可以使用regex_left_join
中的fuzzyjoin
并进行group_by
连接(paste
)
library(fuzzyjoin)
library(dplyr)
df1 %>%
regex_left_join(df2, by = c('DESCRIPTION' = 'Keywords'),
ignore_case = TRUE) %>%
group_by(ID, NAME, Month, DESCRIPTION) %>%
summarise(Keywords = toString(unique(tolower(Keywords))))
# A tibble: 3 x 5
# Groups: ID, NAME, Month [?]
# ID NAME Month DESCRIPTION Keywords
# <int> <chr> <chr> <chr> <chr>
#1 12 x1 Jan funding recived fund
#2 14 x3 Mar new year event new, event
#3 23 x2 Feb author of the book author, book
df1 <- structure(list(ID = c(12L, 23L, 14L), NAME = c("x1", "x2", "x3"
), Month = c("Jan", "Feb", "Mar"), DESCRIPTION = c("funding recived",
"author of the book", "new year event")), .Names = c("ID", "NAME",
"Month", "DESCRIPTION"), class = "data.frame", row.names = c(NA,
-3L))
df2 <- structure(list(Keywords = c("New", "FUND", "EVENT", "Author",
"book")), .Names = "Keywords", class = "data.frame", row.names = c(NA,
-5L))
答案 1 :(得分:1)
解决方案是使用stringr::str_detect
检查每个Keywords
中DESCRIPTION
的存在。
library(stringr)
df1$Keywords <- mapply(function(x)paste(df2$Keywords[str_detect(x, tolower(df2$Keywords))],
collapse = ","), df1$DESCRIPTION)
df1
# ID NAME Month DESCRIPTION Keywords
# 1 12 x1 Jan funding recived FUND
# 2 23 x2 Feb author of the book Author,book
# 3 14 x3 Mar new year event New,EVENT
数据:强>
df1 <- read.table(text =
"ID NAME Month DESCRIPTION
12 x1 Jan 'funding recived'
23 x2 Feb 'author of the book'
14 x3 Mar 'new year event'",
header = TRUE, stringsAsFactors = FALSE)
df2 <- read.table(text =
"Keywords
New
FUND
EVENT
Author
book",
header = TRUE, stringsAsFactors = FALSE)