想要为语料库内的文档编写标签。标签存储在语料库外部的数据框中,具有特定的唯一文档ID。
挑战: (1)从数据帧中取出每个ID, (2)在语料库中找到相应的文档, (3)将标签从数据框设置为具有特定ID的语料库文档。
library("tm")
someID <- paste(letters[1:15], 16:30, sep="")
someTag <- sample(c("a","x","g","h","e"), 15, replace=TRUE)
data(crude) # a corpus with 20 docs
meta(crude, type="local", tag="someID") <- someID # adding some additional IDs to the corpus
mydf <- data.frame(cbind(someTag, someID)) # Creating a dataframe with similar IDs
mydf <- mydf[sample(nrow(mydf)),] # permutation of elements (rows)
rownames(mydf) <- 1:15 # overwriting the rownames
############################################
# doesn't work - my try - pseudocode
for (i in 1:nrow(mydf)){
meta(crude[which(crude$someID==mydf$someID[i])], tag="someTag", type="local") <- mydf$someTag[i]
}
############################################
# How the data looks like:
mydf
# R output:
> mydf
someTag someID
1 h l27
2 x g22
3 h d19
4 a e20
5 h i24
6 x j25
7 h o30
8 x n29
9 e h23
10 x m28
11 h k26
12 e c18
13 a a16
14 e b17
15 x f21
meta(crude[1], type="local")
# R output:
> meta(crude[1], type="local")
Available meta data pairs are:
Author :
DateTimeStamp: 1987-02-26 17:00:56
Description :
Heading : DIAMOND SHAMROCK (DIA) CUTS CRUDE PRICES
ID : 127
Language : en
Origin : Reuters-21578 XML
User-defined local meta data pairs are:
$TOPICS
[1] "YES"
$LEWISSPLIT
[1] "TRAIN"
$CGISPLIT
[1] "TRAINING-SET"
$OLDID
[1] "5670"
$Topics
[1] "crude"
$Places
[1] "usa"
$People
character(0)
$Orgs
character(0)
$Exchanges
character(0)
$someID
[1] "a16"
感谢您的帮助(;
答案 0 :(得分:3)
根据?meta
meta(crude, type="local", tag="someID") <- someID
将在单个文档级别分配元数据标记someID。您想要的是在集合级别创建标记。为此,您希望操纵语料库的DMetaData
属性。你可以这样做:
meta(crude, type="indexed", tag="someID") <- someID
但我发现使用访问
更容易DMetaData(crude)$someID <- someID
(这至少适用于VCorpus类型的语料库)。通过这种调整:
library("tm")
someID <- paste(letters[1:15], 16:30, sep="")
someTag <- sample(c("a","x","g","h","e"), 15, replace=TRUE)
data(crude) # a corpus with 20 docs
# Need to be sure to allocate full tag and id set.
DMetaData(crude)$someID <- c(someID,rep(NA,5))
DMetaData(crude)$someTag <- rep(NA,20)
mydf <- data.frame(cbind(someTag, someID), stringsAsFactors=FALSE) # Creating a dataframe with similar IDs
mydf <- mydf[sample(nrow(mydf)),] # permutation of elements (rows)
rownames(mydf) <- 1:15 # overwriting the rownames
for (i in 1:nrow(mydf)){
DMetaData(crude)$someTag[DMetaData(crude)$someID==mydf$someID[i]]<- mydf$someTag[i]
}
结果:
> DMetaData(crude)
MetaID someID someTag
1 0 a16 a
2 0 b17 h
3 0 c18 g
4 0 d19 a
5 0 e20 e
6 0 f21 a
7 0 g22 x
8 0 h23 g
9 0 i24 h
10 0 j25 e
11 0 k26 x
12 0 l27 a
13 0 m28 a
14 0 n29 h
15 0 o30 a
16 0 <NA> <NA>
17 0 <NA> <NA>
18 0 <NA> <NA>
19 0 <NA> <NA>
20 0 <NA> <NA>