我有这种格式的数据框:
df <- data.frame(names= c('perform data cleansing','information categorisation', ''))
names
1 perform data cleansing
2 information categorisation
3 write batch record documentation
我想获得具有所有同感度的一个:
names tokens1 tokens2
1 perform data cleansing perform data
1 perform data cleansing data cleansing
1 perform data cleansing cleansing perform
2 information categorisation information categorisation
3 write batch record documentation write batch
3 write batch record documentation write record
3 write batch record documentation write documentation
3 write batch record documentation batch record
3 write batch record documentation batch documentation
3 write batch record documentation record documentation
因此,对于字符串中的n
个单词,您将拥有n x (n-1) / 2
个cococurrencies。
答案 0 :(得分:0)
我们可以按空格分隔“名称”,遍历unnest
分隔的元素,获得list
library(tidyverse)
df %>%
mutate(tokens = strsplit(names, " ") %>%
map(~ .x %>%
combn(m = 2, simplify = FALSE))) %>%
unnest
paste
如果我们需要作为两个单独的“令牌”列,则将combn
个单词的unnest
在一起,然后将separate
和paste
的“令牌”分为两个通过在用于df %>%
mutate(tokens = strsplit(names, " ") %>%
map(~ .x %>%
combn(m = 2, FUN = function(x)
paste(x[1], x[2], sep="-"), simplify = FALSE))) %>%
unnest %>%
unnest %>%
separate(tokens, into = c('tokens1', 'tokens2'))
# names tokens1 tokens2
#1 perform data cleansing perform data
#2 perform data cleansing perform cleansing
#3 perform data cleansing data cleansing
#4 information categorisation information categorisation
#5 write batch record documentation write batch
#6 write batch record documentation write record
#7 write batch record documentation write documentation
#8 write batch record documentation batch record
#9 write batch record documentation batch documentation
#10 write batch record documentation record documentation
在一起的定界符处进行分割
df <- structure(list(names = c("perform data cleansing",
"information categorisation",
"write batch record documentation")), class = "data.frame",
row.names = c("1", "2", "3"))
{{1}}