Question

我在使用文本文件和文件的相关元数据方面遇到了一些麻烦。我可以读取文件，预处理它们，然后将它们转换为我正在使用的lda包的可读格式（using this guide by Sievert）。示例如下：

IEnumerable

此时，#Reading the files corpus <- file.path("Folder/Fiction/texts") corpus <- list.files(corpus) corpus <- lapply(corpus, readLines) ***pre-processing functions removed for space*** corp.list <- strsplit(corpus, "[[:space:]]+") # compute the table of terms: corpterm.table <- table(unlist(corp.list)) corpterm.table <- sort(corpterm.table, decreasing = TRUE) ***removing stopwords, again removed for space*** # now put the corpus into the format required by the lda package: getCorp.terms <- function(x) { index <- match(x, vocabCorp) index <- index[!is.na(index)] rbind(as.integer(index - 1), as.integer(rep(1, length(index)))) } corpus <- lapply(corp.list, getCorp.terms)变量是一个文档令牌列表，每个文档都有一个单独的向量，但已从其文件路径和文件名中分离。这里是我的问题开始的地方：我有一个带有文本元数据的csv（他们的文件名，标题，作者，年份，类型等），我想与每个标记向量相关联，以便轻松按时间，按性别等模拟我的信息。

我不确定如何做到这一点，但我猜测它需要在读取文件时完成，并且在我操作文档文本后不合并。我想这会是一个看起来像的东西：

corpus

从那里开始使用合并或匹配功能将每个文档（或文档标记的向量）与其正确的元数据行相关联。

Answer 1

尝试更改为：

pth <- file.path("Folder/Fiction/texts")
fi <- list.files(pth)
corpus <- lapply(fi, readLines)
corp.list <- strsplit(corpus, "[[:space:]]+") 
setNames(object = corp.list, nm = fi) -> corp.list

R：将列表与csv元数据组合在一起

1 个答案: