Question

我的数据框（df）中有一列，如下所示：

> people = df$people
> people[1:3]
[1] "Christian Slater, Tara Reid, Stephen Dorff, Frank C. Turner"     
[2] "Ice Cube, Nia Long, Aleisha Allen, Philip Bolden"                
[3] "John Travolta, Uma Thurman, Vince Vaughn, Cedric the Entertainer"

该列有4k +唯一的first / last / nick名称作为每行的全名列表，如上所示。我想为此列创建一个DocumentTermMatrix，其中找到全名匹配，并且只将最常出现的名称用作列。我尝试了以下代码：

> people_list = strsplit(people, ", ")

> corp = Corpus(VectorSource(people_list))

> dtm = DocumentTermMatrix(corp, people_dict)

其中people_dict是people_list中最常出现的人（约150个全名）的列表，如下所示：

> people_dict[1:3]
[[1]]
[1] "Christian Slater"

[[2]]
[1] "Tara Reid"

[[3]]
[1] "Stephen Dorff"

然而，DocumentTermMatrix函数似乎根本没有使用people_dict，因为我的people_dict中的列数多了。另外，我认为DocumentTermMatrix函数将每个名称字符串拆分为多个字符串。例如，“Danny Devito”成为“Danny”和“Devito”的专栏。

> inspect(actors_dtm[1:5,1:10])
<<DocumentTermMatrix (documents: 5, terms: 10)>>
Non-/sparse entries: 0/50
Sparsity           : 100%
Maximal term length: 9
Weighting          : term frequency (tf)

    Terms
Docs 'g. 'jojo' 'ole' 'piolin' 'rampage' 'spank' 'stevvi' a.d. a.j. aaliyah
   1   0      0     0        0         0       0        0    0    0       0
   2   0      0     0        0         0       0        0    0    0       0
   3   0      0     0        0         0       0        0    0    0       0
   4   0      0     0        0         0       0        0    0    0       0
   5   0      0     0        0         0       0        0    0    0       0

我已经阅读了所有可以找到的TM文档，并且我花了几个小时在stackoverflow上搜索解决方案。请帮忙！

Answer 1

默认的标记生成器将文本拆分为单个单词。您需要提供自定义功能

commasplit_tokenizer <- function(x)
unlist(strsplit(as.character(x), ", "))

请注意，在创建语料库之前，不要将演员分开。

people <- character(3)
people[1] <- "Christian Slater, Tara Reid, Stephen Dorff, Frank C. Turner"     
people[2] <- "Ice Cube, Nia Long, Aleisha Allen, Philip Bolden"                
people[3] <- "John Travolta, Uma Thurman, Vince Vaughn, Cedric the Entertainer"

people_dict <- c("Stephen Dorff", "Nia Long", "Uma Thurman")

控制选项不适用于Coprus，我使用了VCorpus

corp = VCorpus(VectorSource(people))
dtm = DocumentTermMatrix(corp, control = list(tokenize = 
commasplit_tokenizer, dictionary = people_dict, tolower = FALSE))

所有选项都在控件内传递，包括：

tokenize - function
字典
tolower = FALSE

结果：

as.matrix(dtm)
Terms
Docs Nia LOng Stephen Dorff Uma Thurman
   1        0             1           0
   2        0             0           0
   3        0             0           1

我希望这会有所帮助

在名字和姓氏的向量上使用DocumentTermMatrix

1 个答案: