如何将R中的文本文件作为一行读取

时间:2014-12-07 03:05:49

标签: regex r text text-processing text-mining

我正在尝试处理文本文件。总的来说,我有一个我想分析的语料库。为了使用tm包(R中的文本挖掘包)来创建一个Corpus对象,我需要使这个段落成为一个巨大的向量,以便正确读取。

我有一个段落

          Commercial exploitation over the past two hundred years drove                  
          the great Mysticete whales to near extinction.  Variation in                   
          the sizes of populations prior to exploitation, minimal                        
          population size during exploitation and current population                     
          sizes permit analyses of the effects of differing levels of                    
          exploitation on species with different biogeographical                         
          distributions and life-history characteristics.

我已经使用了scan和readLine方法,它会像这样处理文本:

[28]“过去两百年的商业开发驱动” [29]“伟大的mysticete鲸鱼接近灭绝变异” [30]“开发前人口规模最小”

有没有办法摆脱换行符?或者将文本文件作为一个巨大的载体阅读?

到目前为止,所有发布的解决方案都非常棒,谢谢。

3 个答案:

答案 0 :(得分:4)

如果要对文件进行太多处理,可能需要很长时间才能阅读。您可以考虑不加改变地阅读它,然后进行更改。 stringi包具有此特定操作的功能。作者用C语言编写,因此它们的功能很好而且快速

假设您已在文件中读取并将其命名为txt

library(stringi)
stri_flatten(txt)
# [1] "          Commercial exploitation over the past two hundred years drove                  \n          the great Mysticete whales to near extinction.  Variation in                   \n          the sizes of populations prior to exploitation, minimal                        \n          population size during exploitation and current population                     \n          sizes permit analyses of the effects of differing levels of                    \n          exploitation on species with different biogeographical                         \n          distributions and life-history characteristics."

字符串仍然是相同的格式,只是扁平化。要检查我们是否可以查看cat

cat(stri_flatten(txt))
          Commercial exploitation over the past two hundred years drove                  
          the great Mysticete whales to near extinction.  Variation in                   
          the sizes of populations prior to exploitation, minimal                        
          population size during exploitation and current population                     
          sizes permit analyses of the effects of differing levels of                    
          exploitation on species with different biogeographical                         
          distributions and life-history characteristics.

答案 1 :(得分:3)

前一段时间我遇到了同样的问题并找到了解决方法:阅读各行,然后将它们粘贴在一起,删除" \ n"换行符:

filename <- "tmp.txt"
paste0(readLines(filename),collapse=" ")

如果您需要换行符,则可以将该文件作为字符串

读取
readChar(filename,1e5)

指定足够多的字符(在这种情况下为100000)。

答案 2 :(得分:3)

这会将整个文件读入一个长度为一个字符的向量。

x <- readChar(file, file.info(file)$size)