带有反斜杠的scan()出错

时间:2012-11-20 01:45:21

标签: r

我正在运行以下代码并出现此类错误。

> rat <- scan("sortedratings.csv",nlines=760,sep=",",what=rat.cols,multi.line=FALSE);                                                       
Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,  :                                                                         
  line 755 did not have 8 elements                                                                                                                    
>    

这是导致所有麻烦的行

ubuntu@ip-10-28-6-239:/data/csv$ sed -n "750,760p" sortedratings.csv                                                                                  
"281656475","2.5.0","Jul 17, 2011","","","KK9876",4,0                                                                                                 
"281656475","2.5.0","Jul 17, 2011","","","Lyteskin45",4,0                                                                                             
"281656475","2.5.0","Jul 17, 2011","","","Mrs. Felton",5,0                                                                                            
"281656475","2.5.0","Jul 17, 2011","","","Nick Bartoszek",4,0                                                                                         
"281656475","2.5.0","Jul 17,2011","","","SANFRANPSYCHO",5,0                                                                                          
"281656475","2.5.0","Jul 17, 2011","","","Wxcgfduytrewjgf@!?$(:@&amp;&amp;$&amp;@\"",5,0                                                              
"281656475","2.5.0","Jul 18, 2011","","","Downs58",5,0                                                                                                
"281656475","2.5.0","Jul 18, 2011","","","kitty1019",5,0                                                                                              
"281656475","2.5.0","Jul 18, 2011","","","Rj&amp;e",4,0                                                                                               
"281656475","2.5.0","Jul 18, 2011","","","Robin Kinzer",5,0                                                                                           
"281656475","2.5.0","Jul 18, 2011","","","Roderick Palmer",5,0                                                                                        
ubuntu@ip-10-28-6-239:/data/csv$ s

我尝试了不同的修复方法,但我找不到正确的方法。有什么想法吗?

我在删除没有文字或任何内容的反斜杠时没有问题。

哦,忘了添加,文件是1.4GB大,所以我无法读取所有文件或只是用sed替换它,因为它对我的系统来说太大了。

1 个答案:

答案 0 :(得分:4)

?scan的“详细信息”部分(由read.tableread.csv使用等):

 If ‘sep’ is non-default, the fields may be quoted in the style of
 ‘.csv’ files where separators inside quotes (‘''’ or ‘""’) are
 ignored and quotes may be put inside strings by doubling them.
 However, if ‘sep = "\n"’ it is assumed by default that one wants
 to read entire lines verbatim.

所以看起来您的问题是该行中的“转义”引用\"导致问题 - R期望CSV的转义报价为双引号"",而不是反向报价\"

我认为你最好的选择是用双引号替换转义引号,无论是使用Linux还是使用R(下面的R示例):

txt <- readLines("tmp.txt")
txt <- gsub('\\\\"', '""', txt) # note the weird double backslashing because
                                # `readLines` adds extra backslashes
# if you `cat(txt, sep='\n')` you will see that the `\"` is now `""`

然后您可以像之前一样使用read.csvscan(请注意textConnection(txt)将字符串转换为类文件对象以供scan使用):

read.csv(textConnection(txt), ...)

修改/添加

Re OP的评论 - 文件是1.4GB,并且很难一次性将其全部读入R中,那么如何进行消毒呢?

选项1

您似乎在Linux上,因此可以使用sed

sed -ire 's!\\"!""!g' myfile.txt

(根据您的数据来源,也许您可​​以调整输出数据的程序,以便首先以您需要的格式输出数据,但这并非总是可行。)

选项2

如果您不喜欢使用Linux或想要内部R解决方案,请使用n readLines参数,一次只读几行:

# create the file object and open it, see ?file
f <- file('tmp.txt')
open(f)
txt <- ''

# now read in 100 lines at a time, say
while (length(txt)) {
    txt <- readLines(f, n=100)
    # now do the sanitizing/coercing into a data frame, store.
    # ...
}
close(f)