我有一个600个文本文件的主体,我想从术语Create Table Test1
(
PK1 int not null
, PK2 int not null
, Primary Key ( PK1, PK2 )
)
Create Table Test2
(
Id int not null Auto_Increment
, PK1 int not null
, PK2 int not null
, Primary Key ( ID )
, Constraint FK_Test2
Foreign Key ( PK1, PK2 )
References Test1( PK1, PK2 )
)
之后的每个数字组合中提取它们,并创建mim
来找到document term matrix
。 ,它提取了所有想要的术语,但在应用文档术语矩阵时返回了frequencies per file
。.我的语料库是一个简单的文本文件语料库,仅包含此代码文本
0
这是我以这种方式使用时的数据样本;效果很好
library("tm")
library("stringr")
mim<-stringr::str_extract_all(DBcorp,"(mim)[[:blank:]]*[[:digit:]]+")
#extract numbers
mim<-stringr::str_extract_all(mim,"[[:digit:]]+")
#set the result as list + delete duplicated extracted terms
mim<-unique(unlist(mim[[1]]))
mim
[1] "608106" "606843" "103600" "231550"
class(omim)
[1] "character"
#document term matrix
dtm_mim <- DocumentTermMatrix(DBcorp, control=list(dictionary=mim))
# turn document term matrix into data.frame
df_mim <- data.frame(DOC = dtm_mim$dimnames$Docs, as.matrix(dtm_mim), row.names = NULL , check.names = FALSE)
df_mim
608106 606843 103600 231550
1.txt 0 0 0 0
2.txt 0 0 0 0
3.txt 0 0 0 0
但是当我在单独的文本文件中创建文档时,提取失败
docs = c(doc1 = "mim 608106 letters 123 mim 606843 letters 1 letters 123456789 ",
doc2 = "letters letters 1 mim 231550 123 letters",
doc3 = "mim 103600 letters 123456")
docs<-Corpus(VectorSource(docs))
答案 0 :(得分:0)
请尝试以下代码。如果要在tm语料库上使用函数,则最好使用lapply(或tm_map)。这将仅返回出现在mim中的dtm中的术语。
# note the use of simplify = TRUE. This makes sure you do not get a warning in the line after this one.
mim <- lapply(DBcorp, stringr::str_extract_all, "(mim)[[:blank:]]*[[:digit:]]+", simplify = TRUE)
mim <- lapply(mim, stringr::str_extract_all, "[[:digit:]]+")
mim <- unique(unlist(mim))
dtm_mim <- DocumentTermMatrix(DBcorp, control = list(dictionary = mim))
df_mim <- data.frame(DOC = dtm_mim$dimnames$Docs, as.matrix(dtm_mim), row.names = NULL , check.names = FALSE)