我是R编程的新手......
我有多个文本文件,每行都有一个单词:
我想导入所有文本文件并创建数据框..这样的事情:
library(rTextTools)
data(USCongress)
View(USCongress)
我想将单词放在一行中,然后创建一个带有变量'text'的data.frame,就像在参考数据中一样(USCongress) 请帮忙
会话信息:
R version 3.0.3 (2014-03-06)
Platform: x86_64-w64-mingw32/x64 (64-bit)
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] magrittr_1.5 RTextTools_1.4.2 SparseM_1.05
loaded via a namespace (and not attached):
[1] bitops_1.0-6 BradleyTerry2_1.0-5 brglm_0.5-9 car_2.0-21 caret_6.0-37 caTools_1.17.1
[7] class_7.3-9 codetools_0.2-8 coin_1.0-24 colorspace_1.2-4 digest_0.6.4 e1071_1.6-4
[13] foreach_1.4.2 ggplot2_1.0.0 glmnet_1.9-8 grid_3.0.3 gtable_0.1.2 gtools_3.4.1
[19] ipred_0.9-3 iterators_1.0.7 kernlab_0.9-19 lattice_0.20-27 lava_1.2.6 lme4_1.1-7
[25] MASS_7.3-35 Matrix_1.1-2 maxent_1.3.3.1 minqa_1.2.4 modeltools_0.2-21 munsell_0.4.2
[31] mvtnorm_1.0-1 nlme_3.1-113 nloptr_1.0.4 nnet_7.3-8 parallel_3.0.3 party_1.0-18
[37] plyr_1.8.1 prodlim_1.4.5 proto_0.3-10 randomForest_4.6-10 Rcpp_0.11.3 reshape2_1.4
[43] rpart_4.1-5 sandwich_2.3-2 scales_0.2.4 slam_0.1-32 splines_3.0.3 stats4_3.0.3
[49] stringr_0.6.2 strucchange_1.5-0 survival_2.37-7 tau_0.0-18 tm_0.5-10 tools_3.0.3
[55] tree_1.0-35 zoo_1.7-11
我试过了:
Data <- paste0("h:/desktop/datasci/new",list.files("~/new/")) %>%
+ sapply(.,read.table) %>%
+ do.call(rbind,.) %>%
+ apply(.,1,paste0,collapse=" ") %>%
+ data.frame(text=.,row.names=NULL)
但这给了我一个错误:
Error in file(file, "rt") : cannot open the connection
In addition: Warning message:
In file(file, "rt") :
cannot open file 'h:/desktop/datasci/new': Permission denied
由于
答案 0 :(得分:1)
这是一种方法,其中.txt
文件位于"~/tempfiles/"
目录中:
library(magrittr)
##
Df <- paste0("~/tempfiles/",list.files("~/tempfiles/")) %>%
sapply(.,read.table) %>%
do.call(rbind,.) %>%
apply(.,1,paste0,collapse=" ") %>%
data.frame(text=.,row.names=NULL)
R> Df
text
1 file1_line1 file1_line2 file1_line3 file1_line4 file1_line5
2 file2_line1 file2_line2 file2_line3 file2_line4 file2_line5
3 file3_line1 file3_line2 file3_line3 file3_line4 file3_line5
你可以在不使用magrittr
的情况下做到这一点,但它比看一堆嵌套函数调用更清晰。
对于上面的示例,我刚刚在一个丢弃的目录.txt
中制作了三个无意义的~/tempfiles/
文件:
R> list.files("~/tempfiles/")
[1] "file1.txt" "file2.txt" "file3.txt"
看起来像这样:
R> read.table("~/tempfiles/file1.txt")
V1
1 file1_line1
2 file1_line2
3 file1_line3
4 file1_line4
5 file1_line5
R> read.table("~/tempfiles/file2.txt")
V1
1 file2_line1
2 file2_line2
3 file2_line3
4 file2_line4
5 file2_line5
etc ... sapply
用于迭代目标目录中的所有文件,并使用do.call(rbind(...))
将结果合并到[3 x 5]数组中。这通过管道输入apply
,我们将5列中的每一列折叠成一个向量(长度为3),最后形成data.frame
。