匹配数据框中列的文本

时间:2018-05-30 20:15:18

标签: r dplyr stringr

我希望找到搜索字符串中出现的关键字(在本例中为研究问题)。我想我已经接近了,但我不太确定我遇到的问题是什么。我的数据框看起来像这样:

Q1                                                     keywords
How do you assess strategic deterrence messaging?      Deterrence messaging effects perception assessment
An energy transition for green growth                  Energy transition sustainable
Some other research question here                      research keywords topics etc

其中Q1引用问题而keywords是单词列表(在这种情况下,清除了删除AND,NOT和OR的布尔搜索)。我要确定的是keywords字符串中是否有任何Q1出现,找到匹配项,并计算发生的频率(因此我可以说keywords }出现在column1 n %的时间内,在column2 n %的时间内......)。

我在这里开始使用tidyverse

df_final <- df %>% 
  mutate(matches = str_extract_all(
    Q1,
    str_c(df$keywords, collapse = "|") %>% regex(ignore_case = T)),
    match = map_chr(matches, str_c, collapse = ", "),
    count = map_int(matches, length)
  )

但我没有得到任何比赛。我假设它可能与我的keyword列有关?这是否需要转换为矢量或逗号分隔列表才能正常工作?提前感谢您的建议!

编辑:来自dput()的示例输出:

structure(list(Q1 = c("Assessing the effects of strategic deterrence messaging in the cognitive dimension", 
"How do you assess effects of strategic deterrence messaging?", 
"Determine Strategic Implications of Climate Change to USG/DoD"
), keywords = c("Deterrence messaging effects perception assessment", 
"political philosophy sociology social sciences history marketing power structure government governing class bourgeoisie social class military class ruling class governing class", 
"Climate Change Strategic Global Warming Strategic Climate Change Policy Global Warming Policy"
)), .Names = c("Q1", "keywords"), row.names = c(NA, -3L), class = c("tbl_df", 
"tbl", "data.frame"))

2 个答案:

答案 0 :(得分:1)

以下代码将根据问题后的关键字返回您的data.frame以及问题中关键字的出现次数。在您的示例输出中为3 0 6.所有函数均来自tidyverse包。

library(stringr)
library(dplyr)
library(purrr)

df  %>%  mutate(count = map2_int(Q1, keywords, function(x, y) sum(str_detect(str_to_lower(x), str_to_lower(flatten_chr(str_split(y, " ")))))))

# A tibble: 3 x 3
  Q1                                                                                 keywords                                        count
  <chr>                                                                              <chr>                                           <int>
1 Assessing the effects of strategic deterrence messaging in the cognitive dimension Deterrence messaging effects perception assess~     3
2 How do you assess effects of strategic deterrence messaging?                       political philosophy sociology social sciences~     0
3 Determine Strategic Implications of Climate Change to USG/DoD                      Climate Change Strategic Global Warming Strate~     6

数据:

df <- structure(list(Q1 = c("Assessing the effects of strategic deterrence messaging in the cognitive dimension", 
"How do you assess effects of strategic deterrence messaging?", 
"Determine Strategic Implications of Climate Change to USG/DoD"
), keywords = c("Deterrence messaging effects perception assessment", 
"political philosophy sociology social sciences history marketing power structure government governing class bourgeoisie social class military class ruling class governing class", 
"Climate Change Strategic Global Warming Strategic Climate Change Policy Global Warming Policy"
)), row.names = c(NA, -3L), class = c("tbl_df", "tbl", "data.frame"
))

答案 1 :(得分:0)

可能不是最佳但可能有帮助。我添加tolower()因为我认为你不关心威慑或威慑。

a <-tolower(unique(unlist(strsplit(df$keywords, " "))))

dfcounter <- data.frame(table(tolower(unlist(strsplit(df$Q1, " ")))),stringsAsFactors = F)

dfcounter[match(a,dfcounter$Var1,nomatch = 0),]