将多个文本文件读入r以进行文本挖掘

时间:2017-01-27 04:08:11

标签: r text-mining

我有一批文本文件需要读入r才能进行文本挖掘。

到目前为止,我尝试使用qdap包中的read.table,read.line,lapply,mcsv_r无济于事。我曾尝试编写一个循环来读取文件,但我必须指定文件的名称,每次迭代都会更改。

以下是我的尝试:

# Relative path points to the local folder
folder.path="../data/InauguralSpeeches/"

# get the list of file names
speeches=list.files(path = folder.path, pattern = "*.txt")

for(i in 1:length(speeches))
  {

    text_df <- do.call(rbind,lapply(speeches[i],read.csv))

}

此外,我尝试过以下方法:

library(data.table)  
files <- list.files(path = folder.path,pattern = ".csv")
temp <- lapply(files, fread, sep=",")
data <- rbindlist( temp )

当inaugAbrahamLincoln-1.csv明显存在于文件夹中时,它给了我这个错误:

files <- list.files(path = folder.path,pattern = ".csv")
> temp <- lapply(files, fread, sep=",")
Error in FUN(X[[i]], ...) : 
  File 'inaugAbrahamLincoln-1.csv' does not exist. Include one or more spaces to consider the input a system command.
> data <- rbindlist( temp )
Error in rbindlist(temp) : object 'temp' not found
> 

但它只适用于.csv文件,而不适用于.txt文件。

是否有更简单的方法从多个源文件进行文本挖掘?如果是这样的话?

由于

3 个答案:

答案 0 :(得分:3)

我经常遇到同样的问题。我维护的 textreadr 包旨在简化这些文档的.csv,.pdf,.doc和.docx文档和目录。它会减少你正在做的事情:

textreadr::read_dir("../data/InauguralSpeeches/")

您的示例不可复制,所以我在下面执行此操作(请将您的示例在将来重现)。

library(textreadr)

## Minimal working example
dir.create('delete_me')
file.copy(dir(system.file("docs/Maas2011/pos", package = "textreadr"), full.names=TRUE), 'delete_me', recursive=TRUE)
write.csv(mtcars, 'delete_me/mtcars.csv')
write.csv(CO2, 'delete_me/CO2.csv')
cat('test\n\ntesting\n\ntester', file='delete_me/00_00.txt')

## the read in of a directory
read_dir('delete_me') 

输出

下面的输出显示了在document列中注册的每个文档的tibble输出。对于文档中的每一行,该文档都有一行。根据csv文件中的内容,这可能不够精细。<​​/ p>

##    document                                  content
## 1       0_9 Bromwell High is a cartoon comedy. It ra
## 2     00_00                                     test
## 3     00_00                                         
## 4     00_00                                  testing
## 5     00_00                                         
## 6     00_00                                   tester
## 7       1_7 If you like adult comedy cartoons, like 
## 8      10_9 I'm a male, not given to women's movies,
## 9      11_9 Liked Stanley & Iris very much. Acting w
## 10     12_9 Liked Stanley & Iris very much. Acting w
## ..      ...                                      ... 
## 141   mtcars "Ferrari Dino",19.7,6,145,175,3.62,2.77,
## 142   mtcars "Maserati Bora",15,8,301,335,3.54,3.57,1
## 143   mtcars "Volvo 142E",21.4,4,121,109,4.11,2.78,18

答案 1 :(得分:2)

以下代码将目录中的所有* .csv文件读取到单个data.frame:

dir <- '~/Desktop/testcsv/'
files <- list.files(dir,pattern = '*.csv', full.names = TRUE)
data <- lapply(files, read.csv)
df <- do.call(rbind, data)

请注意,我添加了参数full.names = TRUE。这将为您提供绝对路径,这就是为什么您收到“inaugAbrahamLincoln-1.csv”的错误,即使它存在。

答案 2 :(得分:1)

这是一种方法。

library(data.table)
setwd("C:/Users/Excel/Desktop/CSV Files/")

WD="C:/Users/Excel/Desktop/CSV Files/"
# read headers
data<-data.table(read.csv(text="CashFlow,Cusip,Period"))

csv.list<- list.files(WD)
k=1

for (i in csv.list){
  temp.data<-read.csv(i)
  data<-data.table(rbind(data,temp.data))

  if (k %% 100 == 0)
    print(k/length(csv.list))

  k<-k+1
}