我有这种格式的数据框:
df <- data.frame(names= c('perform data cleansing','information categorisation'))
names
1 perform data cleansing
2 information categorisation
我正在尝试获取以下格式:
names tokens
1 perform data cleansing perform
1 perform data cleansing data
1 perform data cleansing cleansing
2 information categorisation information
2 information categorisation categorisation
答案 0 :(得分:2)
我喜欢tidyr::unnest
:
library(dplyr)
library(tidyr)
df %>% mutate(tokens = strsplit(as.character(names), split = " ")) %>%
unnest()
# names tokens
# 1 perform data cleansing perform
# 2 perform data cleansing data
# 3 perform data cleansing cleansing
# 4 information categorisation information
# 5 information categorisation categorisation
但是您也可以在base
中完成所有操作:
tokens = strsplit(as.character(df$names), split = " ")
result = data.frame(names = rep(df$names, lengths(tokens)),
tokens = unlist(tokens),
stringsAsFactors = FALSE)
# names tokens
# 1 perform data cleansing perform
# 2 perform data cleansing data
# 3 perform data cleansing cleansing
# 4 information categorisation information
# 5 information categorisation categorisation
tidytext::unnest_tokens
是带有文本分析额外功能的版本:
df$names = as.character(df$names)
tidytext::unnest_tokens(df, output = tokens, input = names, drop = FALSE)
# names tokens
# 1 perform data cleansing perform
# 1.1 perform data cleansing data
# 1.2 perform data cleansing cleansing
# 2 information categorisation information
# 2.1 information categorisation categorisation