无法使用csv读取文本文件?

时间:2012-09-27 07:21:45

标签: r csv

我有一个文本数据,用“逗号”分隔,即“,”。下面给出了数据样本(第一行表示列名称):

userID,appName,startTime,endTime,endResult
chhieut,gms.mos.test,2012-07-01 02:47:16,2012-07-01 02:47:46,1
chhieut,gms.mos.test,2012-07-01 03:11:46,2012-07-01 03:12:25,2
chhieut,gms.mos.test,2012-07-01 03:13:36,2012-07-01 03:14:03,2
chhieut,gms.mos.test,2012-07-01 03:18:26,2012-07-01 03:18:58,2
chhieut,gms.mos.test,2012-07-01 04:10:36,2012-07-01 04:10:54,2
chhieut,gms.mos.test,2012-07-01 04:38:26,2012-07-01 04:38:48,2
chhieut,gms.mos.test,2012-07-01 04:48:56,2012-07-01 04:49:04,3
chhieut,gms.mos.test,2012-07-01 05:49:46,2012-07-01 05:50:14,2
chhieut,gms.mos.test,2012-07-01 06:19:07,2012-07-01 06:19:25,2
chhieut,gms.mos.test,2012-07-01 07:09:17,2012-07-01 07:09:47,2

我使用以下语法:

appsession <- read.table("C:/.../AppSession.txt", sep = ",", 
  col.names = c("userID","appName","startTime","endTime","endResult"), 
  fill = FALSE, strip.white = TRUE)

我收到此错误:

Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,  : 
  line 1 did not have 5 elements

3 个答案:

答案 0 :(得分:3)

如果您有一个空行并且计划在不使用skip = 2的情况下使用'col.names',我认为您需要使用header=TRUE。目前你的代码工作(无论如何都很好),只需一个简单的文本“

> txt <- "userID,appName,startTime,endTime,endResult
+ chhieut,gms.mos.test,2012-07-01 02:47:16,2012-07-01 02:47:46,1
+ chhieut,gms.mos.test,2012-07-01 03:11:46,2012-07-01 03:12:25,2
+ chhieut,gms.mos.test,2012-07-01 03:13:36,2012-07-01 03:14:03,2
+ chhieut,gms.mos.test,2012-07-01 03:18:26,2012-07-01 03:18:58,2
+ chhieut,gms.mos.test,2012-07-01 04:10:36,2012-07-01 04:10:54,2
+ chhieut,gms.mos.test,2012-07-01 04:38:26,2012-07-01 04:38:48,2
+ chhieut,gms.mos.test,2012-07-01 04:48:56,2012-07-01 04:49:04,3
+ chhieut,gms.mos.test,2012-07-01 05:49:46,2012-07-01 05:50:14,2
+ chhieut,gms.mos.test,2012-07-01 06:19:07,2012-07-01 06:19:25,2
+ chhieut,gms.mos.test,2012-07-01 07:09:17,2012-07-01 07:09:47,2
+ "
> appsession <- read.table(text=txt, sep = ",", 
+   col.names = c("userID","appName","startTime","endTime","endResult"), 
+   fill = FALSE, strip.white = TRUE)
> 
> appsession
    userID      appName           startTime             endTime endResult
1   userID      appName           startTime             endTime endResult
2  chhieut gms.mos.test 2012-07-01 02:47:16 2012-07-01 02:47:46         1
3  chhieut gms.mos.test 2012-07-01 03:11:46 2012-07-01 03:12:25         2
4  chhieut gms.mos.test 2012-07-01 03:13:36 2012-07-01 03:14:03         2
5  chhieut gms.mos.test 2012-07-01 03:18:26 2012-07-01 03:18:58         2
6  chhieut gms.mos.test 2012-07-01 04:10:36 2012-07-01 04:10:54         2
7  chhieut gms.mos.test 2012-07-01 04:38:26 2012-07-01 04:38:48         2
8  chhieut gms.mos.test 2012-07-01 04:48:56 2012-07-01 04:49:04         3
9  chhieut gms.mos.test 2012-07-01 05:49:46 2012-07-01 05:50:14         2
10 chhieut gms.mos.test 2012-07-01 06:19:07 2012-07-01 06:19:25         2
11 chhieut gms.mos.test 2012-07-01 07:09:17 2012-07-01 07:09:47         2

您应该使用标题或跳过标题行(以及跳过任何空白行。)查看空行数的一种方法是查看countfields( ..., sep=",")的输出。另一种查看read.*scan函数“看到”的方法是执行此代码(适当替换省略号):

appLines <- readLines("C:/.../AppSession.txt")
appLines[1:5] # will display the first 5 lines from that file 
              # with no attempt to deal with any separators.

答案 1 :(得分:2)

您需要提供实际数据集的链接,因为您提供的数据可以正常工作:

d = read.csv(textConnection("userID,appName,startTime,endTime,endResult
chhieut,gms.mos.test,2012-07-01 02:47:16,2012-07-01 02:47:46,1
chhieut,gms.mos.test,2012-07-01 03:11:46,2012-07-01 03:12:25,2
chhieut,gms.mos.test,2012-07-01 03:13:36,2012-07-01 03:14:03,2
chhieut,gms.mos.test,2012-07-01 03:18:26,2012-07-01 03:18:58,2
chhieut,gms.mos.test,2012-07-01 04:10:36,2012-07-01 04:10:54,2
chhieut,gms.mos.test,2012-07-01 04:38:26,2012-07-01 04:38:48,2
chhieut,gms.mos.test,2012-07-01 04:48:56,2012-07-01 04:49:04,3
chhieut,gms.mos.test,2012-07-01 05:49:46,2012-07-01 05:50:14,2
chhieut,gms.mos.test,2012-07-01 06:19:07,2012-07-01 06:19:25,2
chhieut,gms.mos.test,2012-07-01 07:09:17,2012-07-01 07:09:47,2"), header=TRUE)

快速检查:

R> head(d, 1)
   userID      appName           startTime             endTime endResult
1 chhieut gms.mos.test 2012-07-01 02:47:16 2012-07-01 02:47:46         1
R> dim(d)
[1] 10  5

确保您的实际文件中没有空白行 - 这确实会让事情变得充实。

答案 2 :(得分:2)

使用适当编辑的数据版本(即删除所有空行!),可以通过read.csv()轻松加载到R中。请注意,我正在使用包含数据的文本连接,以避免将数据写入文件。只需将con替换为read.csv()中的文件名。

con <- textConnection("userID,appName,startTime,endTime,endResult
chhieut,gms.mos.test,2012-07-01 02:47:16,2012-07-01 02:47:46,1
chhieut,gms.mos.test,2012-07-01 03:11:46,2012-07-01 03:12:25,2
chhieut,gms.mos.test,2012-07-01 03:13:36,2012-07-01 03:14:03,2
chhieut,gms.mos.test,2012-07-01 03:18:26,2012-07-01 03:18:58,2
chhieut,gms.mos.test,2012-07-01 04:10:36,2012-07-01 04:10:54,2
chhieut,gms.mos.test,2012-07-01 04:38:26,2012-07-01 04:38:48,2
chhieut,gms.mos.test,2012-07-01 04:48:56,2012-07-01 04:49:04,3
chhieut,gms.mos.test,2012-07-01 05:49:46,2012-07-01 05:50:14,2
chhieut,gms.mos.test,2012-07-01 06:19:07,2012-07-01 06:19:25,2
chhieut,gms.mos.test,2012-07-01 07:09:17,2012-07-01 07:09:47,2
")

dat <- read.csv(con,
                colClasses = c(rep("character", 2), rep("POSIXct", 2),
                               "numeric"))
close(con) ## closing connection, not needed with a file

另请注意,通过指定colclasses参数,我们告诉R在读取数据之前数据是什么,稍后会保存一些格式,特别是使用DateTime数据。我们可以在这里执行此操作,因为您以正确的格式存储了DateTime变量。

R> head(dat)
   userID      appName           startTime             endTime endResult
1 chhieut gms.mos.test 2012-07-01 02:47:16 2012-07-01 02:47:46         1
2 chhieut gms.mos.test 2012-07-01 03:11:46 2012-07-01 03:12:25         2
3 chhieut gms.mos.test 2012-07-01 03:13:36 2012-07-01 03:14:03         2
4 chhieut gms.mos.test 2012-07-01 03:18:26 2012-07-01 03:18:58         2
5 chhieut gms.mos.test 2012-07-01 04:10:36 2012-07-01 04:10:54         2
6 chhieut gms.mos.test 2012-07-01 04:38:26 2012-07-01 04:38:48         2
R> str(dat)
'data.frame':   10 obs. of  5 variables:
 $ userID   : chr  "chhieut" "chhieut" "chhieut" "chhieut" ...
 $ appName  : chr  "gms.mos.test" "gms.mos.test" "gms.mos.test" "gms.mos.test" ...
 $ startTime: POSIXct, format: "2012-07-01 02:47:16" "2012-07-01 03:11:46" ...
 $ endTime  : POSIXct, format: "2012-07-01 02:47:46" "2012-07-01 03:12:25" ...
 $ endResult: num  1 2 2 2 2 2 3 2 2 2