我有一个.csv文件,其中包含所有文本字段。但是,某些文本字段包含不转义的双引号字符,例如:
"ID","Text","Optional text","Date"
"1","Today is going to be a good day","","2013-02-03"
"2","And I am inspired by the quote "every dog must have it's day"","Hi","2013-01-01"
"3","Did not the bard say All the World's a stage" this quote is so true","Terrible","2013-05-05"
第1行和第2行很好,但第3行未正确读入。目前,我正在手动浏览Notepad ++中的文件以尝试删除此类引号。理想情况下,我希望R能够处理此问题,但我认为无与伦比的双引号的不转义性质使这种期望变得不合理。
在Notepad ++中,我试图构建一个正则表达式以标识双引号(而不是逗号)。逻辑是有效的双引号将出现在字段的开头或结尾,并由相邻的逗号表示。这可能有助于确定我的大部分案件,然后可以处理。
只能说我有大约340万条记录,而大约0.1%似乎是有问题的。
编辑: 建议使用data.table中的fread作为替代方法,但使用fread的成功率甚至更低:
1: In fread(paste(infilename, "1", ".csv", sep = "")) :
Stopped early on line 21. Expected 18 fields but found 9. Consider fill=TRUE and comment.char=. First discarded non-empty line
所有建议的选项均无效。我认为这是因为“文本”字段也可以包含CRLF字符。 read.csv似乎只是忽略了这些(好),而fread则例外。抱歉,我无法提供实际的文本,但是这里有一些更全面的测试数据,它具有无与伦比的双引号(read.csv有问题)和CRLF(fread有问题)。
"ID","Text","Optional text","Date"
"1","Today is going to be a good day","","2013-02-03"
"2","And I am inspired by the quote "every dog must have it's day"","Hi","2013-01-01"
"3","An issue with this line is that it contains a CRLF here
which is not usual.","Again an unusual CRLF
is present in these data","2013-02-02"
"4","Did not the bard say All the World's a stage" this quote is so true","Terrible","2013-05-05"
在Notepad ++中对正则表达式的帮助会很棒。
答案 0 :(得分:2)
也许一种选择是在记事本++中使用条件替换。
您可以找到所有以双引号开头,以逗号开头或字符串开头的字符串。
然后匹配双引号,直到遇到逗号或字符串末尾的下一个双引号。这些是白色的行,所以对于要捕获和替换的替代部分,请匹配双引号而不是逗号。
查找内容:
(?:^|,)"[^"\n]*"(?=$|,)|(?<!,)(")(?!,)
替换为:
有条件的替换。如果是第1组,则替换为空,否则用匹配项替换。
(?{1}:$0)
说明
(?:^|,)
匹配逗号或声明字符串的开头"[^"\n]*"
在两者之间没有双引号的情况下匹配双引号(?=$|,)
断言右边的内容是字符串的结尾还是逗号|
或(?<!,)(")(?!,)
在group1中使用双引号,同时断言左侧和右侧的内容不是逗号答案 1 :(得分:1)
似乎可以与data.table::fread
一起很好地工作:
fread("E:/temp/test.txt")
# ID Text Optional text "Date"
#1: 1 Today is going to be a good day 2013-02-03
#2: 2 And I am inspired by the quote "every dog must have it's day" Hi 2013-01-01
#3: 3 Did not the bard say "All the World's a stage" this quote is so true Terrible 2013-05-05
#Warning message:
#In fread("E:/temp/test.txt") :
# Found and resolved improper quoting in first 100 rows. If the fields are not quoted (e.g. field separator does not appear within any field), try quote="" to avoid this warning.