使用colClasses在read.table中出错

时间:2015-03-06 14:31:13

标签: r read.table

我会读取一个文本文件(使用read.table),其中包含三个字符,如" 000000"但我得到0而不是。我试着用:

X<-read.table(ouvrefic, header=TRUE, row.names=1, sep="",colClasses=c("integer","character","factor"))

我得到了:

Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,  : 
scan() expected 'an integer', got '"1"' (problem comes from row.names, I suppose)

我该怎么做?

非常感谢。

我的文字文件的开头:

"" "dates" "Atscan2" "pqrPQR"
"1" "18369" "0000000000000" "1110"
"2" "18369" "0000000000000" "1220,0"
"3" "18369" "0000000000000" "2220"
"4" "18369" "0000000000000" "1230,0,0"
"5" "18369" "0000000000000" "1330,0"
"6" "18369" "0000000000000" "2330,0"
"7" "18369" "0000000000000" "3330"

2 个答案:

答案 0 :(得分:1)

问题出在colClasses参数:

首先,即使您将第一列用作row.names,您也有4列。因此,您需要该向量中的四个元素。

如果您需要正确显示所有零,则需要将该列作为字符。

以下作品:

df <- read.table(header=T, text='"" "dates" "Atscan2" "pqrPQR"
"1" "18369" "0000000000000" "1110"
"2" "18369" "0000000000000" "1220,0"
"3" "18369" "0000000000000" "2220"
"4" "18369" "0000000000000" "1230,0,0"
"5" "18369" "0000000000000" "1330,0"
"6" "18369" "0000000000000" "2330,0"
"7" "18369" "0000000000000" "3330"', 
row.names=1, 
colClasses=c('character', 'character',"character","factor"))

输出:

> df
  dates       Atscan2   pqrPQR
1 18369 0000000000000     1110
2 18369 0000000000000   1220,0
3 18369 0000000000000     2220
4 18369 0000000000000 1230,0,0
5 18369 0000000000000   1330,0
6 18369 0000000000000   2330,0
7 18369 0000000000000     3330

如上所示,问题是如果引用了列的元素(如日期列),那么在integer中使用colClasses选项将不起作用(因此我将其转换为字符以及)。之后您可以随时使用as.integer并将其转换为整数。

Akrun在评论中提供了直接解决方案,这些评论将首先删除从readLines读取的双引号,然后在列上应用colClasses

 df <- read.table(text=gsub('[\\"]', '', readLines('ouvrefic.txt')),
                  row.names=1, 
                  colClasses=c('character', 'integer', 'character', 'factor'))

答案 1 :(得分:1)

NA

时,您也可以在colClasses中使用row.names = 1
writeLines('"" "dates" "Atscan2" "pqrPQR"
"1" "18369" "0000000000000" "1110"
"2" "18369" "0000000000000" "1220,0"
"3" "18369" "0000000000000" "2220"
"4" "18369" "0000000000000" "1230,0,0"
"5" "18369" "0000000000000" "1330,0"
"6" "18369" "0000000000000" "2330,0"
"7" "18369" "0000000000000" "3330"', "x.txt")

df <- read.table("x.txt", header = TRUE, 
     row.names = 1, colClasses = c(NA, NA, "character", NA))

sapply(df, class)
#      dates     Atscan2      pqrPQR 
#  "integer" "character"    "factor" 
df
#   dates       Atscan2   pqrPQR
# 1 18369 0000000000000     1110
# 2 18369 0000000000000   1220,0
# 3 18369 0000000000000     2220
# 4 18369 0000000000000 1230,0,0
# 5 18369 0000000000000   1330,0
# 6 18369 0000000000000   2330,0
# 7 18369 0000000000000     3330

此外,如果您使用的是基于Linux的,则可以使用system()删除所有引号并使其更容易

read.table(
    text = system("cat x.txt | tr -d \\\"", intern = TRUE), 
    colClasses = c(Atscan2 = "character")
)
#   dates       Atscan2   pqrPQR
# 1 18369 0000000000000     1110
# 2 18369 0000000000000   1220,0
# 3 18369 0000000000000     2220
# 4 18369 0000000000000 1230,0,0
# 5 18369 0000000000000   1330,0
# 6 18369 0000000000000   2330,0
# 7 18369 0000000000000     3330