如何将文本拆分为向量,其中每个条目对应于分配给每个唯一单词的索引值?

时间:2019-02-07 14:42:48

标签: r dplyr word stringi

假设我有一个文档,其中包含类似SO的文本:

doc <- 'Questions with similar titles have frequently been downvoted and/or closed. Consider using a title that more accurately describes your question.'

然后我可以创建一个数据框,其中每个单词在df中都有一行:

library(stringi)
dfall <- data.frame(words = unlist(stri_extract_all_words(stri_trans_tolower(doc))))

我们将添加第三列及其唯一ID。要获取ID,请删除重复项:

library(dplyr)
uniquedf <- distinct(data.frame(words = unlist(stri_extract_all_words(stri_trans_tolower(doc)))))

我正在努力将行与两个数据帧进行匹配,以从uniquedf中提取行索引值作为df的新行值

alldf <- alldf %>% mutate(id = which(uniquedf$words == words))

像这样的dply方法不起作用。

有没有更有效的方法?

为了给出一个更简单的示例来显示预期的输出,我想要一个看起来像这样的数据框:

  words id
1     to  1
2     row  2
3     zip  3
4     zip  3

我的起始单词向量是:doc <- c('to', 'row', 'zip', 'zip')doc <- c('to row zip zip')。 id列为每个唯一单词添加唯一ID。

1 个答案:

答案 0 :(得分:2)

使用sapply的便宜方式

数据

doc <- 'Questions with with titles have frequently been downvoted and/or closed. Consider using a title that more accurately describes your question.'

功能

alldf=cbind(dfall,sapply(1:nrow(dfall),function(x) which(uniquedf$words==dfall$words[x])))

colnames(alldf)=c("words","id")
> alldf
        words id
1   questions  1
2        with  2
3        with  2
4      titles  3
5        have  4
6  frequently  5
7        been  6
8   downvoted  7
9         and  8
10         or  9
11     closed 10
12   consider 11
13      using 12
14          a 13
15      title 14
16       that 15
17       more 16
18 accurately 17
19  describes 18
20       your 19
21   question 20