使用R编程,我需要从文件中获取标记ngram = 2.
问题在于它组合了线条,一些标记在行尾有部分,另一部分在下一行开头
Req_tok <-jobs %>% unnest_tokens(ngram,POSITION, token = "ngrams", n = 2)
在文件作业中我有前两行:
it architect
it helpdesk support agents
我得到的代币如下:
it architect
architect it
it helpdesk
and so on ....
该做什么才能获得像#34;架构它的标记&#34;
我想分别标记每一行
答案 0 :(得分:0)
只需在collapse = FALSE
中添加unnest_tokens
:
library(tidytext)
library(dplyr)
jobs %>%
unnest_tokens(ngram, POSITION, token = "ngrams", n = 2, collapse = FALSE)
<强>结果:强>
ngram
1 it architect
2 it helpdesk
2.1 helpdesk support
2.2 support agents
如果它是一个因子变量,请记住将字符串向量转换为字符,否则unnest_token
会引发错误。
数据:强>
jobs = data.frame(POSITION = c("it architect", "it helpdesk support agents"), stringsAsFactors = FALSE)