Question

我有一个像这样的tm语料库对象：

> summary(corp.eng)
A corpus with 154 text documents

The metadata consists of 2 tag-value pairs and a data frame
Available tags are:
  create_date creator 
Available variables in the data frame are:
  MetaID

语料库中每个文档的元数据都是这样的：

> meta(corp.eng[[1]])
Available meta data pairs are:
  Author       : 
  DateTimeStamp: 2013-04-18 14:37:24
  Description  : 
  Heading      : 
  ID           : Smith-John_e.txt
  Language     : en_CA
  Origin       :

我知道我可以一次设置一个文档的作者：

meta(corp.eng[[1]],tag="Author") <-  
  paste(
    rev(
      unlist(
        strsplit(meta(corp.eng[[1]],tag="ID"), c("[-_]"))
      )[1:2]
    ), collapse=' ')

给我一个这样的结果：

> meta(corp.eng[[1]],tag="Author")
[1] "John Smith"

如何批处理工作？

Answer 1

注意：这应该仍然可能是一个评论，但有一些工作部分，所以这里举个例子：

data(crude)
extracted.values <- meta(crude,tag="Places",type="local")
for (i in seq_along(extracted.values)) {
     meta(crude[[i]],tag="Places") <- substr(extracted.values[[i]],1,3)
}

也应该能够使用lapply来完成它，但由于我不熟悉tm的内部工作原理，我会坚持使用循环。将substr函数替换为您需要的函数，当然也替换左侧的数据。希望这可以帮助。

如何通过解析文档ID为语料库中的每个文档设置作者

1 个答案: