R轻敲字符串中所有令人舒服的单词-coocurences-Bigram-dplyr

时间:2018-12-13 14:51:51

标签: r dataframe dplyr nlp

我有这种格式的数据框:

df <- data.frame(names= c('perform data cleansing','information categorisation', ''))
                             names
1           perform data cleansing
2       information categorisation
3 write batch record documentation

我想获得具有所有同感度的一个:

                             names           tokens1              tokens2
1           perform data cleansing           perform                 data
1           perform data cleansing              data            cleansing 
1           perform data cleansing         cleansing              perform
2       information categorisation       information       categorisation
3 write batch record documentation             write                batch
3 write batch record documentation             write               record
3 write batch record documentation             write        documentation 
3 write batch record documentation             batch               record 
3 write batch record documentation             batch        documentation 
3 write batch record documentation            record        documentation 

因此,对于字符串中的n个单词,您将拥有n x (n-1) / 2个cococurrencies。

1 个答案:

答案 0 :(得分:0)

我们可以按空格分隔“名称”,遍历unnest分隔的元素,获得list library(tidyverse) df %>% mutate(tokens = strsplit(names, " ") %>% map(~ .x %>% combn(m = 2, simplify = FALSE))) %>% unnest

paste

如果我们需要作为两个单独的“令牌”列,则将combn个单词的unnest在一起,然后将separatepaste的“令牌”分为两个通过在用于df %>% mutate(tokens = strsplit(names, " ") %>% map(~ .x %>% combn(m = 2, FUN = function(x) paste(x[1], x[2], sep="-"), simplify = FALSE))) %>% unnest %>% unnest %>% separate(tokens, into = c('tokens1', 'tokens2')) # names tokens1 tokens2 #1 perform data cleansing perform data #2 perform data cleansing perform cleansing #3 perform data cleansing data cleansing #4 information categorisation information categorisation #5 write batch record documentation write batch #6 write batch record documentation write record #7 write batch record documentation write documentation #8 write batch record documentation batch record #9 write batch record documentation batch documentation #10 write batch record documentation record documentation 在一起的定界符处进行分割

df <- structure(list(names = c("perform data cleansing", 
   "information categorisation", 
 "write batch record documentation")), class = "data.frame",
  row.names = c("1", "2", "3"))

数据

{{1}}