提取多个txt文档的第一行并将其另存为R中的元数据/矢量

时间:2018-08-04 13:48:33

标签: r metadata text-analysis

我是R的新手,目前正在使用Quanteda软件包进行文本分析。对于一段时间内的主题模型,我需要在正在使用的txt文件上使用元数据。我的每个文档的第一行都包含一个日期,我想以一种链接到文档的方式来提取该日期。 我能够从文档的第一行中提取日期,其结构如下 “日期:1980年4月23日”  通过使用以下代码:

fileName <- "C:/Users/fischer/project/_Los_Angeles_Times_The_New_York_Times_The_Was2018-01-14_01-01.txt"
mytxt <- readChar(fileName, file.info(fileName)$size)
regmatches(extracted_texts, regexec("date:",extracted_texts))
date<-regmatches(extracted_texts, gregexpr(
  "date:[0-9]{2}/[0-9]{2}/[0-9]{4}", extracted_texts))

R返回 “日期:1980年4月23日”

我无法实现的是将其应用于目录中的多个文档,并将输出保存为变量/向量,这使我可以将其用作quenteda包功能的元数据。

1 个答案:

答案 0 :(得分:0)

您可能会在下面找到一种选择来做自己想做的事情。根据文档的数量和大小,您可能必须顺序处理文档,而无法一次将它们存储在列表中。如果以下方法对您不起作用,我无法为您提供替代方法。

# create some sample docs and write them in the working directory as txt files
docs = list(
  doc1 = c("abc2000", "def")
  ,doc2 = c("ghi2001", "jkl")
  ,doc3 = c("mno2002", "pqr")
)
for (i in names(docs)) {
  writeLines(docs[[i]], paste0(i, ".txt"))
}
# read the documents with the specified doc name into a list
# use list.files with a pattern for this, maybe with fullnames = TRUE in your case
# you can use the side effect of sapply/USE.NAMES, to generate a named list
# you might set n = 1 in readLines to only read the first line
# since you probably read in the full document at a certain point in your code
# here the full text example is read in
docs = sapply(list.files(pattern = "^doc\\d.txt$"), readLines, USE.NAMES = T, simplify = F)
# then access the first element of each list and apply your extraction function
# as an example I simply extract digits
lapply(docs, function(x) {
  first_line = x[1]
  gsub("\\D", "", first_line)
})
# $`doc1.txt`
# [1] "2000"
# 
# $doc2.txt
# [1] "2001"
# 
# $doc3.txt
# [1] "2002"