R中的chunking txt文件

时间:2017-03-06 21:58:21

标签: r text topic-modeling

所有

我在他的"文本分析与文学学生的文本分析"中表现了Matthew Jockers的代码。书。

在其中,他提供了从XML文档中提取所有<p>标记的代码,将这些内容分成1000个字块并应用一堆数据按摩技巧。完成后,他将该分块函数插入到一个循环中,该循环生成一个可以在槌中使用的数据矩阵。请参阅下面的代码。

我的问题是,如何使用.txt文件执行相同的操作?显然,文本文件没有像<p>这样的属性。我不是一个经验丰富的程序员,所以请放轻松!

chunk.size <- 1000 #number of words per chunk
makeFlexTextChunks <- function(doc.object, chunk.size=1000,  percentage=TRUE){

paras <- getNodeSet(doc.object,
                  "/d:TEI/d:text/d:body//d:p",
                  c(d = "http://www.tei-c.org/ns/1.0"))
words <- paste(sapply(paras,xmlValue), collapse=" ")
words.lower <- tolower(words)
words.lower <- gsub("[^[:alnum:][:space:]']", " ", words.lower)
words.l <- strsplit(words.lower, "\\s+")
word.v <- unlist(words.l)
x <- seq_along(word.v)
if(percentage){
max.length <- length(word.v)/chunk.size
chunks.l <- split(word.v, ceiling(x/max.length))
  } else {
chunks.l <- split(word.v, ceiling(x/chunk.size))
#deal with small chunks at the end
if(length(chunks.l[[length(chunks.l)]]) <=
   length(chunks.l[[length(chunks.l)]])/2){
  chunks.l[[length(chunks.l)-1]] <-
    c(chunks.l[[length(chunks.l)-1]],
      chunks.l[[length(chunks.l)]])
  chunks.l[[length(chunks.l)]] <- NULL
}
}
chunks.l <- lapply(chunks.l, paste, collapse=" ")
chunks.df <- do.call(rbind, chunks.l)
return(chunks.df)
}


topic.m <- NULL
for(i in 1:length(files.v)){
doc.object <- xmlTreeParse(file.path(input.dir, files.v[i]),
                         useInternalNodes=TRUE)
chunk.m <- makeFlexTextChunks(doc.object, chunk.size,
                            percentage=FALSE)
textname <- gsub("\\..*","", files.v[i])
segments.m <- cbind(paste(textname,
                        segment=1:nrow(chunk.m), sep="_"), chunk.m)
topic.m <- rbind(topic.m, segments.m)
}

2 个答案:

答案 0 :(得分:1)

谢谢大家的帮助。我想经过多次试验和错误后我找到了答案!关键是在循环而不是函数中使用scan(paste(input.dir,files.v [i],sep =“/”)来拉取txt文件。请在此处查看我的代码:

input.dir <- "data/plainText"
files.v <- dir(input.dir, ".*txt")
chunk.size <- 100 #number of words per chunk
makeFlexTextChunks <- function(doc.object, chunk.size=100,  percentage=TRUE){
words.lower <- tolower(paste(doc.object, collapse=" "))
words.lower <- gsub("[^[:alnum:][:space:]']", " ", words.lower)
words.l <- strsplit(words.lower, "\\s+")
word.v <- unlist(words.l)
x <- seq_along(word.v)

if(percentage){
max.length <- length(word.v)/chunk.size
chunks.l <- split(word.v, ceiling(x/max.length))
}  
else {
chunks.l <- split(word.v, ceiling(x/chunk.size))
#deal with small chunks at the end
if(length(chunks.l[[length(chunks.l)]]) <=
   length(chunks.l[[length(chunks.l)]])/2){
  chunks.l[[length(chunks.l)-1]] <-
    c(chunks.l[[length(chunks.l)-1]],
      chunks.l[[length(chunks.l)]])
  chunks.l[[length(chunks.l)]] <- NULL
 }
}
chunks.l <- lapply(chunks.l, paste, collapse=" ")
chunks.df <- do.call(rbind, chunks.l)
return(chunks.df)
}

topic.m <- NULL
for(i in 1:length(files.v)){
doc.object <- scan(paste(input.dir, files.v[i], sep="/"), what="character", sep="\n")
chunk.m <- makeFlexTextChunks(doc.object, chunk.size, percentage=FALSE)
textname <- gsub("\\..*","", files.v[i])
segments.m <- cbind(paste(textname, segment=1:nrow(chunk.m), sep="_"), chunk.m)
topic.m <- rbind(topic.m, segments.m)
}

答案 1 :(得分:0)

也许这可以指出你正确的方向。以下代码读入txt文件,将单词拆分为向量的元素。

library(readr)
library(stringr)

url <- "http://www.gutenberg.org/files/98/98-0.txt"
mystring <- read_file(url)
res <- str_split(mystring, "\\s+")

然后你可以把它分成1000个单词的块并做你的魔法吗?