如何使用r从多个文本文件创建数据框?

时间:2014-12-06 20:44:12

标签: r nlp rstudio

我是R编程的新手......

我有多个文本文件,每行都有一个单词:

sample file

我想导入所有文本文件并创建数据框..这样的事情:

library(rTextTools)
data(USCongress)
View(USCongress)

data frame

我想将单词放在一行中,然后创建一个带有变量'text'的data.frame,就像在参考数据中一样(USCongress) 请帮忙

会话信息:

R version 3.0.3 (2014-03-06)
Platform: x86_64-w64-mingw32/x64 (64-bit)

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252    LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                           LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] magrittr_1.5     RTextTools_1.4.2 SparseM_1.05    

loaded via a namespace (and not attached):
 [1] bitops_1.0-6        BradleyTerry2_1.0-5 brglm_0.5-9         car_2.0-21          caret_6.0-37        caTools_1.17.1     
 [7] class_7.3-9         codetools_0.2-8     coin_1.0-24         colorspace_1.2-4    digest_0.6.4        e1071_1.6-4        
[13] foreach_1.4.2       ggplot2_1.0.0       glmnet_1.9-8        grid_3.0.3          gtable_0.1.2        gtools_3.4.1       
[19] ipred_0.9-3         iterators_1.0.7     kernlab_0.9-19      lattice_0.20-27     lava_1.2.6          lme4_1.1-7         
[25] MASS_7.3-35         Matrix_1.1-2        maxent_1.3.3.1      minqa_1.2.4         modeltools_0.2-21   munsell_0.4.2      
[31] mvtnorm_1.0-1       nlme_3.1-113        nloptr_1.0.4        nnet_7.3-8          parallel_3.0.3      party_1.0-18       
[37] plyr_1.8.1          prodlim_1.4.5       proto_0.3-10        randomForest_4.6-10 Rcpp_0.11.3         reshape2_1.4       
[43] rpart_4.1-5         sandwich_2.3-2      scales_0.2.4        slam_0.1-32         splines_3.0.3       stats4_3.0.3       
[49] stringr_0.6.2       strucchange_1.5-0   survival_2.37-7     tau_0.0-18          tm_0.5-10           tools_3.0.3        
[55] tree_1.0-35         zoo_1.7-11

我试过了:

Data <- paste0("h:/desktop/datasci/new",list.files("~/new/")) %>%
+     sapply(.,read.table) %>%
+     do.call(rbind,.) %>%
+     apply(.,1,paste0,collapse=" ") %>%
+     data.frame(text=.,row.names=NULL) 

但这给了我一个错误:

Error in file(file, "rt") : cannot open the connection
In addition: Warning message:
In file(file, "rt") :
  cannot open file 'h:/desktop/datasci/new': Permission denied

由于

1 个答案:

答案 0 :(得分:1)

这是一种方法,其中.txt文件位于"~/tempfiles/"目录中:

library(magrittr)
##
Df <- paste0("~/tempfiles/",list.files("~/tempfiles/")) %>%
  sapply(.,read.table) %>%
  do.call(rbind,.) %>%
  apply(.,1,paste0,collapse=" ") %>%
  data.frame(text=.,row.names=NULL)
R> Df
                                                         text
1 file1_line1 file1_line2 file1_line3 file1_line4 file1_line5
2 file2_line1 file2_line2 file2_line3 file2_line4 file2_line5
3 file3_line1 file3_line2 file3_line3 file3_line4 file3_line5

你可以在不使用magrittr的情况下做到这一点,但它比看一堆嵌套函数调用更清晰。

对于上面的示例,我刚刚在一个丢弃的目录.txt中制作了三个无意义的~/tempfiles/文件:

R> list.files("~/tempfiles/")
[1] "file1.txt" "file2.txt" "file3.txt"

看起来像这样:

R> read.table("~/tempfiles/file1.txt")
           V1
1 file1_line1
2 file1_line2
3 file1_line3
4 file1_line4
5 file1_line5
R> read.table("~/tempfiles/file2.txt")
           V1
1 file2_line1
2 file2_line2
3 file2_line3
4 file2_line4
5 file2_line5

etc ... sapply用于迭代目标目录中的所有文件,并使用do.call(rbind(...))将结果合并到[3 x 5]数组中。这通过管道输入apply,我们将5列中的每一列折叠成一个向量(长度为3),最后形成data.frame