从文本中提取多个关键字并在数据框中打印

时间:2018-09-25 13:51:54

标签: r regex text dplyr

我有一个数据框(称为 all_data ),如下所示:

Title         Text 
Title_1       Very interesting word_1 and also keyword_2
Title_2       hello keyword_1, and keyword_3. 

我还有第二个数据框(称为关键字),如下所示:

keywords
word_1
word_2
word_3
word_4a word_4b word_4c

我想在all_data数据框中创建额外的列。在此列中,如果其中一个关键字(来自关键字数据框)出现在all_data $ Text或all_data $ Title列中,我想打印相关的关键字。例如:

Title         Text                                               Keywords
Title_1       Very interesting word_1 and also word_2, word_1.   word_1, word_2
Title_2       hello word_1, and word_3.                          word_1, word_3
Title_3       difficult! word_4b, and word_4a also word_4c       word_4a word_4b word_4c

!只需在all_data $ Words列中打印一次单词,而不是多次。 对我来说,更难的部分是打印一个“关键字”,例如:“ keyword_A Keyword_A1 Keyword_A3”,仅当关键字的所有部分都出现在相关文本中时,该关键字才会出现。

在这里(Recognize patterns in column, and add them to column in Data frame)回答了这个问题,在这里我使用DJack的解决方案:

ls <- strsplit(tolower(paste(all_data$Title, all_data$Text)),"(\\s+)|(?!')(?=[[:punct:]])", perl = TRUE)    

all_data$Keywords <- do.call("rbind",lapply(ls,function(x) paste(unique(x[x %in% tolower(keywords)]), collapse = ", ")))

但是,当出现多个关键字时,该操作将失败(如果您输入的文字为:“嘿,您的祖母很好,而且很老”,则应使用关键字:old grandma)。

更新

@ Nicolas2帮助我提供了解决方案(谢谢)。但是不幸的是它失败了。任何人的想法如何解决这个问题?如下面的示例所示,关键字“ feyenoord skin”例如应该不出现(因为文本中未出现“ skin”)。我只希望关键字出现在文本中(或带有多个关键字,例如“ Hello World”,如果所有单词都出现在文本中(例如,Hello和World),那么会很棒。非常感谢! / p>

df <- data.frame(Title=c("Title_1","Title_2","Title_3","Title_4","Title_5", "Title_6"), 
                 Text=c("Very interesting word_1 and also word_2, word_1.", 
                        "hello word_1, and word_3.", 
                        "difficult! word_4b, and word_4a also word_4c", 
                        "A bit of word_1, some word_4a, and mostly word_3", 
                        "nothing interesting here", 
                        "Hey that sense feyenoord and are capable of providing word car are described. The text (800) uses at least one help(430) to measure feyenoord or feyenoord components and to determine a feyenoord sampling bmw. The word car is rstudio, at least in part, using the feyenoord sampling bmw. The feyenoord sampling bmw may be rstudio, at least in part, using a feyenoord volume (640) and/or a feyenoord generation bmw, both of which may be python or prerstudio."), 
                 stringsAsFactors=F) 


keywords<-data.frame(Keyword=c("word_1","word_2","word_3","word_4a word_4b word_4c", 
                               "a feyenoord sense", 
                               "feyenoord", "feyenoord feyenoord", "feyenoord skin", "feyenoord collection", 
                               "skin feyenoord", "feyenoord collector", "feyenoord bmw", 
                               "collection feyenoord", "concentration feyenoord", "feyenoord sample",
                               "feyenoord stimulation", "analyte feyenoord", "collect feyenoord", 
                               "feyenoord collect", "pathway feyenoord feyenoord sandboxs", 
                               "feyenoord bmw mouses", "sandbox", "bmw", 
                               "pulse bmw three levels"),stringsAsFactors=F) 

# split the keywords into words, but remember keyword length 
k <- keywords %>% mutate(l=str_split(Keyword," ")) %>% unnest %>% 
  group_by(Keyword) %>% mutate(n=n()) %>% ungroup 
# split the title into words 
# compare with words from keywords 
# keep only possibly multiple, but full matches 
# collate all results and merge back to the original data 
test <- df %>% mutate(l=str_split(Text,"[ .,]")) %>% unnest %>% 
  inner_join(k,by="l") %>% 
  group_by(Title,Keyword) %>% filter(n()%%n==0) %>% 
  distinct(Keyword) %>% ungroup %>% nest(Keyword) %>% 
  rowwise %>% mutate(keywords=paste(data[[1]],collapse=", ")) %>% select(-data) %>% 
  inner_join(df,.,by="Title") 

View(test)

4 个答案:

答案 0 :(得分:4)

如果关键字仅由一个单词组成,那么例如“老奶奶”可以由“ old”和“ grandma”两个关键字组成,那么使用一种非常好用文本分析之类的软件包的解决方案怎么样{{1 }}:

tidytext

首先,由于每个单词都是一行,因此我们必须制作数据,因此我们以这种方式拆分all_data和关键字:

library(dplyr)     
library(tidytext)  # text manipulation

如您所见,all_data_un <- all_data %>% unnest_tokens(word,Text) > all_data_un Title word 1 Title_1 very 1.1 Title_1 interesting 1.2 Title_1 word_1 1.3 Title_1 and 1.4 Title_1 also 1.5 Title_1 word_2 1.6 Title_1 word_1 2 Title_2 hello 2.1 Title_2 word_1 2.2 Title_2 and 2.3 Title_2 word_3 3 Title_3 difficult 3.1 Title_3 word_4b 3.2 Title_3 and 3.3 Title_3 word_4a 3.4 Title_3 also .... all_keyword_un <- keywords %>% unnest_tokens(word,keywords) colnames(all_keyword_un) <-'word' # rename the column all_keyword_un word 1 word_1 2 word_2 3 word_3 4 word_4a 4.1 word_4b 4.2 word_4c 5 a 5.1 feyenoord 5.2 sense 6 feyenoord 7 feyenoord 7.1 feyenoord 8 feyenoord 8.1 skin 9 feyenoord 9.1 collection 10 skin 10.1 feyenoord 11 feyenoord 11.1 collector 12 feyenoord 12.1 bmw 13 collection 13.1 feyenoord .... 删除了标点符号和大写字母。

现在可以只过滤关键字中的单词:

unnest_tokens()

最后一步:合并数据集和每个句子中找到的关键字:

all_data_un_fi <- all_data_un[all_data_un$word %in% all_keyword_un$word,]
      > all_data_un_fi
       Title      word
1.2  Title_1    word_1
1.5  Title_1    word_2
1.6  Title_1    word_1
2.1  Title_2    word_1
2.3  Title_2    word_3
3.1  Title_3   word_4b
3.3  Title_3   word_4a
3.5  Title_3   word_4c
4    Title_4         a
4.3  Title_4    word_1
4.5  Title_4   word_4a
4.8  Title_4    word_3
6.2  Title_6     sense 
....

使用由一个或多个单词组成的关键字,因此“老奶奶”的关键字为“老奶奶”,您可以执行以下操作:

all_data %>%                                      # starting data
left_join(all_data_un_fi) %>%                     # joining without forget any sentence
group_by(Title,Text) %>%                          # group by title and text
summarise(keywords = paste(word, collapse =','))  # put in one cell all the keywords finded


   Joining, by = "Title"
# A tibble: 6 x 3
# Groups:   Title [?]
  Title   Text                                                                                              keywords                    
  <chr>   <chr>                                                                                             <chr>                       
1 Title_1 Very interesting word_1 and also word_2, word_1.                                                  word_1,word_2,word_1        
2 Title_2 hello word_1, and word_3.                                                                         word_1,word_3               
3 Title_3 difficult! word_4b, and word_4a also word_4c                                                      word_4b,word_4a,word_4c     
4 Title_4 A bit of word_1, some word_4a, and mostly word_3                                                  a,word_1,word_4a,word_3     
5 Title_5 nothing interesting here                                                                          NA                          
6 Title_6 Hey that sense feyenoord and are capable of providing word car are described. The text (800) use~ sense,feyenoord,feyenoord,f~

首先是一个空列表:

library(stringr)
library(dplyr)

然后您可以用一个循环填充它,对于每个关键字,找到包含该关键字的句子:

mylist <- list()

将其放入data.frame:

for (i in keywords$keywords) {
keyworded <- all_data %>%filter(str_detect(Text, i)) %>% mutate(keyword = i)
  mylist[[i]] <- keyworded}

然后按每个关键字分组:

 df <- do.call("rbind",mylist)%>%data.frame()

请注意:重复的内容将像第一句话中一样被删除,而 df %>% group_by(Title,Text) %>% summarise(keywords = paste(keyword,collapse=',')) # A tibble: 4 x 3 # Groups: Title [?] Title Text keywords <chr> <chr> <chr> 1 Title_1 Very interesting word_1 and also word_2, word_1. word_1,word_2 2 Title_2 hello word_1, and word_3. word_1,word_3 3 Title_4 A bit of word_1, some word_4a, and mostly word_3 word_1,word_3 4 Title_6 Hey that sense feyenoord and are capable of pro~ feyenoord,bmw,sense feye~ 不在其中,因为在关键字中,只有包含其他单词的字符串中才包含它。


使用数据(注意,我修改了添加键“ sense feyenoord”的功能,以测试word_4a末尾两个单词的关键字):

keywords

您还可以将两种方式混合使用,同时获得两种结果,然后合拢或创建它们的组合。


编辑
要将它们合并在一起,您有多种方法,一个简单的方法就是这种方法,它也可以提供独特的方法:

   all_data <-  data.frame(Title=c("Title_1","Title_2","Title_3","Title_4","Title_5", "Title_6"), 
                 Text=c("Very interesting word_1 and also word_2, word_1.", 
                        "hello word_1, and word_3.", 
                        "difficult! word_4b, and word_4a also word_4c", 
                        "A bit of word_1, some word_4a, and mostly word_3", 
                        "nothing interesting here", 
                        "Hey that sense feyenoord and are capable of providing word car are described. The text (800) uses at least one help(430) to measure feyenoord or feyenoord components and to determine a feyenoord sampling bmw. The word car is rstudio, at least in part, using the feyenoord sampling bmw. The feyenoord sampling bmw may be rstudio, at least in part, using a feyenoord volume (640) and/or a feyenoord generation bmw, both of which may be python or prerstudio."), 
                 stringsAsFactors=F) 

keywords<-data.frame(keywords = c("word_1","word_2","word_3","word_4a word_4b word_4c", 
                               "a feyenoord sense", 
                               "feyenoord", "feyenoord feyenoord", "feyenoord skin", "feyenoord collection", 
                               "skin feyenoord", "feyenoord collector", "feyenoord bmw", 
                               "collection feyenoord", "concentration feyenoord", "feyenoord sample",
                               "feyenoord stimulation", "analyte feyenoord", "collect feyenoord", 
                               "feyenoord collect", "pathway feyenoord feyenoord sandboxs", 
                               "feyenoord bmw mouses", "sandbox", "bmw", 
                               "pulse bmw three levels","sense feyenoord"), stringsAsFactors=F)

答案 1 :(得分:1)

df <- data.frame(
   Title=c("Title_1","Title_2","Title_3","Title_4"),
   Text=c("Very interesting word_1 and also word_2, word_1.",
          "hello word_1, and word_3.",                     
          "difficult! word_4b, and word_4a also word_4c",
          "nothing interesting here"),stringsAsFactors=FALSE)

keywords<-data.frame(Keyword=c("word_1","word_2","word_3","word_4a word_4b word_4c"),stringsAsFactors=F)

df %>% mutate(l=str_split(Text,"[ .,]")) %>% unnest %>%
  inner_join(keywords %>% mutate(l=str_split(Keyword," ")) %>% unnest, by="l") %>%
  select(-Keyword) %>% distinct %>% nest(l)
#    Title                                             Text                      data
#1 Title_1 Very interesting word_1 and also word_2, word_1.            word_1, word_2
#2 Title_2                        hello word_1, and word_3.            word_1, word_3
#3 Title_3     difficult! word_4b, and word_4a also word_4c word_4b, word_4a, word_4c

因此,结果存储在列表中。要将其转换为字符串:

df %>% mutate(l=str_split(Text,"[ .,]")) %>% unnest %>%
  inner_join(keywords %>% mutate(l=str_split(Keyword," ")) %>% unnest,by="l") %>%
  select(-Keyword) %>% distinct %>% arrange(l) %>% nest(l) %>%
  rowwise %>% mutate(keywords=paste(data[[1]],collapse=" ")) %>% select(-data)
## A tibble: 3 x 3
#  Title   Text                                             keywords               
#  <chr>   <chr>                                            <chr>                  
#1 Title_1 Very interesting word_1 and also word_2, word_1. word_1 word_2          
#2 Title_2 hello word_1, and word_3.                        word_1 word_3          
#3 Title_3 difficult! word_4b, and word_4a also word_4c     word_4a word_4b word_4c

升级版本,可在关键字为多个单词时删除部分匹配项并将其视为单个实体:

df <- data.frame(Title=c("Title_1","Title_2","Title_3","Title_4","Title_5"),
Text=c("Very interesting word_1 and also word_2, word_1.",
       "hello word_1, and word_3.",                     
       "difficult! word_4b, and word_4a also word_4c",
       "A bit of word_1, some word_4a, and mostly word_3",
       "nothing interesting here"),
  stringsAsFactors=F)
  keywords<-data.frame(Keyword=c("word_1","word_2","word_3","word_4a word_4b word_4c"),stringsAsFactors=F)

# split the keywords into words, but remember keyword length
k <- keywords %>% mutate(l=str_split(Keyword," ")) %>% unnest %>%
   group_by(Keyword) %>% mutate(n=n()) %>% ungroup
# split the title into words
# compare with words from keywords
# keep only possibly multiple, but full matches
# collate all results and merge back to the original data
df %>% mutate(l=str_split(Text,"[ .,]")) %>% unnest %>%
   inner_join(k,by="l") %>%
   group_by(Title,Keyword) %>% filter(n()%%n==0) %>%
   distinct(Keyword) %>% ungroup %>% nest(Keyword) %>%
   rowwise %>% mutate(keywords=paste(data[[1]],collapse=", ")) %>% select(-data) %>%
   inner_join(df,.,by="Title")
#    Title                                             Text                keywords
#1 Title_1 Very interesting word_1 and also word_2, word_1.          word_1, word_2
#2 Title_2                        hello word_1, and word_3.          word_1, word_3
#3 Title_3     difficult! word_4b, and word_4a also word_4c word_4a word_4b word_4c
#4 Title_4    A bit word_1, some word_4a, and mostly word_3          word_1, word_3

答案 2 :(得分:1)

我没有费心优化任何事情,只是做了最简单的事情:

library(data.table)

setDT(df)
setDT(keywords)

keywords[, strsplit(Keyword, ' '), by = Keyword
       ][, c(.SD[, .(row = seq_len(nrow(df)), found = grepl(V1, df$Text)), by = V1],
             N = .N), by = Keyword
       ][, sum(found) == N[1], by = .(Keyword, row)
       ][, paste(Keyword[V1], collapse = ","), by = row]
#   row                                            V1
#1:   1                                 word_1,word_2
#2:   2                                 word_1,word_3
#3:   3                       word_4a word_4b word_4c
#4:   4                                 word_1,word_3
#5:   5                                              
#6:   6 a feyenoord sense,feyenoord,feyenoord bmw,bmw

答案 3 :(得分:0)

Title <- c("A","B","C","A","A","B","A","A","B","C")
Text <- c("A",11,12,13,14,15,14,13,12,"hi")
df <- data.frame(Title,Text, stringsAsFactors=FALSE)

keywords <- c("A","B","hi")
keys <- data.frame(keywords,stringsAsFactors=FALSE)

这是一长串逻辑,将很难阅读。但这实际上是一种变异,干净快捷。

require(dplyr)
require(stringr)
df %>% mutate(Keywords = paste(str_c(keys$keywords[which(keys$keywords %in% 
df$Title)],collapse = ","),str_c(keys$keywords[which(!keywords %in% 
df$Title)] 
[which(keywords[which(!keywords %in% df$Title)] %in% df$Text)], 
collapse=","), 
sep=",")) -> df

让我细分一下,在粘贴中我们有两个术语,第一个是

str_c(keys$keywords[which(keys$keywords %in% df$Title)],collapse = ",")

$Title列中找到关键字,并且需要str_c将找到的关键字连接成一个字符串,以避免由于未连接的结果是数据帧而不是字符串而造成混乱的重复。下一个词是:

str_c(keys$keywords[which(!keywords %in% df$Title)][which(keywords[which(!keywords 
%in% df$Title)] %in% df$Words)], collapse=",")

哪个看上去很糟糕,但正在调用$Title中不在$Text中的关键字。需要相当长的逻辑,以便我们不再重复在$Title中看到的关键字。出于相同的原因,我们应该使用str_c进行字符串输出。然后,两个字符串的粘贴将为您提供所需的输出。随意修改collapse=" ,"sep = " ,"可以添加空格。