我是R的新手,目前正在使用Quanteda软件包进行文本分析。对于一段时间内的主题模型,我需要在正在使用的txt文件上使用元数据。我的每个文档的第一行都包含一个日期,我想以一种链接到文档的方式来提取该日期。 我能够从文档的第一行中提取日期,其结构如下 “日期:1980年4月23日” 通过使用以下代码:
fileName <- "C:/Users/fischer/project/_Los_Angeles_Times_The_New_York_Times_The_Was2018-01-14_01-01.txt"
mytxt <- readChar(fileName, file.info(fileName)$size)
regmatches(extracted_texts, regexec("date:",extracted_texts))
date<-regmatches(extracted_texts, gregexpr(
"date:[0-9]{2}/[0-9]{2}/[0-9]{4}", extracted_texts))
R返回 “日期:1980年4月23日”
我无法实现的是将其应用于目录中的多个文档,并将输出保存为变量/向量,这使我可以将其用作quenteda包功能的元数据。
答案 0 :(得分:0)
您可能会在下面找到一种选择来做自己想做的事情。根据文档的数量和大小,您可能必须顺序处理文档,而无法一次将它们存储在列表中。如果以下方法对您不起作用,我无法为您提供替代方法。
# create some sample docs and write them in the working directory as txt files
docs = list(
doc1 = c("abc2000", "def")
,doc2 = c("ghi2001", "jkl")
,doc3 = c("mno2002", "pqr")
)
for (i in names(docs)) {
writeLines(docs[[i]], paste0(i, ".txt"))
}
# read the documents with the specified doc name into a list
# use list.files with a pattern for this, maybe with fullnames = TRUE in your case
# you can use the side effect of sapply/USE.NAMES, to generate a named list
# you might set n = 1 in readLines to only read the first line
# since you probably read in the full document at a certain point in your code
# here the full text example is read in
docs = sapply(list.files(pattern = "^doc\\d.txt$"), readLines, USE.NAMES = T, simplify = F)
# then access the first element of each list and apply your extraction function
# as an example I simply extract digits
lapply(docs, function(x) {
first_line = x[1]
gsub("\\D", "", first_line)
})
# $`doc1.txt`
# [1] "2000"
#
# $doc2.txt
# [1] "2001"
#
# $doc3.txt
# [1] "2002"