当我尝试读取r.csv this数据集时,输出的行比实际数据集的行多:
setwd("D:/yelp_dataset")
data1=read.csv("star3650000c.csv",sep=",",header=TRUE,fill=TRUE,quote="
",na.strings=c("NA","?"),dec=".",comment.char="
",stringsAsFactors=FALSE)
我该怎么办?
答案 0 :(得分:1)
我认为读取表不起作用的主要问题是引号和注释字符的定义包括换行符(至少涉及可以控制的内容,如果数据损坏,通常会丢失)。您可以将它们指定为合理的值,如下所示。请注意,我已设置header = FALSE
以便更轻松地检查最终输出。
character_with_line_break = "
"
# note that the line break is actually included in your character as "\n"
character_with_line_break
# [1] " \n"
# try read with different values for quote and comment characters
df = read.csv("yelp.csv"
,sep=","
,header=FALSE
,fill=TRUE
,quote = "\""
,na.strings=c("NA","?")
,dec=".",comment.char=""
,stringsAsFactors=FALSE)
# there is still something wrong with the last line,
# would have to investigate this further (probably missing EOL marker)
# but the final output looks good (see further down)
# Warning message:
# In read.table(file = file, header = header, sep = sep, quote = quote, :
# incomplete final line found by readTableHeader on 'yelp.csv'
dim(df)
# [1] 4 10
data.frame(lapply(df, function(x) substr(x, 1, 10)))
# V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
# 1 0 uQJ5RNygSe 2 8/4/2011 1 afEfPToTLj 5 I took my 2 uiZMpQSqJ4
# 2 1 VcGyezSNtk 4 1/4/2011 1 lGLLA08Ql4 5 Delicious! 5 uiZMpQSqJ4
# 3 2 39YKi45Pet 1 8/9/2013 0 #NAME? 5 After many 1 uiZMpQSqJ4
# 4 3 UTTTKI61dC 4 3/9/2012 1 Ly5ky2bAoJ 5 Love this 10 uiZMpQSqJ4