更改数据类型R的语料库提取

时间:2018-07-28 10:27:10

标签: data-extraction

我有一个文本文件集,仅包含文本,我想从文本中提取ngram,并将每个文件和其原始文件名保存在3列矩阵中。.

   library(tokenizer)      
    myTokenizer <- function(x, n, n_min) {

corp<-"this is a full text "
     tok <- unlist(tokenize_ngrams(as.character(x), n = n, n_min = n_min))
      M <- matrix(nrow=length(tok), ncol=3, 
                  dimnames=list(NULL, c( "gram" , "num.words", "words")))
      }
    corp <- tm_map(corp,content_transformer(function (x) myTokenizer(x, n=3, n_min=1)))

        writecorpus(corp)

2 个答案:

答案 0 :(得分:2)

由于我没有您的语料库,因此我使用tm的原始数据集创建了自己的语料库。无需使用tm_map,因为这样可以使数据保持语料库格式。 tokenizer包可以处理此问题。

我要做的是通过lapply将所有需要的矩阵存储在列表对象中,然后使用sapply将数据作为单独的文件存储在原始目录中。

要意识到函数中指定的矩阵将是字符矩阵。这意味着第一和第二列将是字符,而不是数字。

library(tm)
data("crude")
crude <- as.VCorpus(crude)

myTokenizer <- function(x, n, n_min) {
  tok <- unlist(tokenizers::tokenize_ngrams(as.character(x), n = n, n_min = n_min))
  M <- matrix(nrow=length(tok), ncol=3, 
              dimnames=list(NULL, c( "gram" , "num.words", "words")))
  M[, 3] <- tok
  M[, 2] <- lengths(strsplit(M[, 3], "\\W+"))  # counts the words
  M[, 1] <- 1:length(tok)
  return(M)
}

my_matrices <- lapply(crude, myTokenizer, n = 3, n_min = 1)

#  make sure directory crude exists as a subfolder in working directory
sapply(names(my_matrices), 
       function (x) write.table(my_matrices[[x]], file=paste("crude/", x, ".txt", sep=""), row.names = FALSE))

第一个文件的结果

"gram" "num.words" "words"
"1" "1" "diamond"
"2" "2" "diamond shamrock"
"3" "3" "diamond shamrock corp"
"4" "1" "shamrock"
"5" "2" "shamrock corp"
"6" "3" "shamrock corp said"

答案 1 :(得分:0)

我建议创建一个文档术语矩阵(DTM)。无论如何,您可能在下游任务中都需要此功能。从中您还可以提取所需的信息,但是,假设一个术语(包括ngram)仅包含一个文档可能是不合理的(至少这是我从您的问题中了解的信息,请如果我错了,请纠正我)。因此,我想实际上一个术语将有多个与之相关的文档-这种信息通常存储在DTM中。

下面带有text2vec的示例。如果您可以进一步详细说明如何使用术语等,我可以根据您的需要修改代码。

library(text2vec)
# I have set up two text do not overlap in any term just as an example
# in practice, this probably never happens
docs = c(d1 = c("here a text"), d2 = c("and another one"))
it = itoken(docs, tokenizer = word_tokenizer, progressbar = F)
v = create_vocabulary(it, ngram = c(1,3))
vectorizer = vocab_vectorizer(v)
dtm = create_dtm(it, vectorizer)
as.matrix(dtm)
#    a a_text and and_another and_another_one another another_one here here_a here_a_text one text
# d1 1      1   0           0               0       0           0    1      1           1   0    1
# d2 0      0   1           1               1       1           1    0      0           0   1    0

library(stringi)
docs = c(d1 = c("here a text"), d2 = c("and another one"))
it = itoken(docs, tokenizer = word_tokenizer, progressbar = F)
v = create_vocabulary(it, ngram = c(1,3))
vectorizer = vocab_vectorizer(v)
dtm = create_dtm(it, vectorizer)
for (d in rownames(dtm)) {
  v = dtm[d, ]
  v = v[v!=0]
  v = data.frame(number = 1:length(v)
                 ,term = names(v))
  v$n = stri_count_fixed(v$term, "_")+1
  write.csv(v, file = paste0("v_", d, ".csv"), row.names = F)
}
read.csv("v_d1.csv")
#   number        term n
# 1      1           a 1
# 2      2      a_text 2
# 3      3        here 1
# 4      4      here_a 2
# 5      5 here_a_text 3
# 6      6        text 1
read.csv("v_d2.csv")
#   number            term n
# 1      1             and 1
# 2      2     and_another 2
# 3      3 and_another_one 3
# 4      4         another 1
# 5      5     another_one 2
# 6      6             one 1