R .csv无法正确读取,因为文本中有双引号

时间:2019-06-03 10:18:12

标签: r regex notepad++

我有一个.csv文件,其中包含所有文本字段。但是,某些文本字段包含不转义的双引号字符,例如:

"ID","Text","Optional text","Date" 
"1","Today is going to be a good day","","2013-02-03"
"2","And I am inspired by the quote "every dog must have it's day"","Hi","2013-01-01"
"3","Did not the bard say All the World's a stage" this quote is so true","Terrible","2013-05-05"

第1行和第2行很好,但第3行未正确读入。目前,我正在手动浏览Notepad ++中的文件以尝试删除此类引号。理想情况下,我希望R能够处理此问题,但我认为无与伦比的双引号的不转义性质使这种期望变得不合理。

在Notepad ++中,我试图构建一个正则表达式以标识双引号(而不是逗号)。逻辑是有效的双引号将出现在字段的开头或结尾,并由相邻的逗号表示。这可能有助于确定我的大部分案件,然后可以处理。

只能说我有大约340万条记录,而大约0.1%似乎是有问题的。

编辑: 建议使用data.table中的fread作为替代方法,但使用fread的成功率甚至更低:

1: In fread(paste(infilename, "1", ".csv", sep = "")) :
  Stopped early on line 21. Expected 18 fields but found 9. Consider fill=TRUE and comment.char=. First discarded non-empty line

所有建议的选项均无效。我认为这是因为“文本”字段也可以包含CRLF字符。 read.csv似乎只是忽略了这些(好),而fread则例外。抱歉,我无法提供实际的文本,但是这里有一些更全面的测试数据,它具有无与伦比的双引号(read.csv有问题)和CRLF(fread有问题)。

"ID","Text","Optional text","Date" 
"1","Today is going to be a good day","","2013-02-03"
"2","And I am inspired by the quote "every dog must have it's day"","Hi","2013-01-01"
"3","An issue with this line is that it contains a CRLF here 
which is not usual.","Again an unusual CRLF
is present in these data","2013-02-02"
"4","Did not the bard say All the World's a stage" this quote is so true","Terrible","2013-05-05"

在Notepad ++中对正则表达式的帮助会很棒。

2 个答案:

答案 0 :(得分:2)

也许一种选择是在记事本++中使用条件替换。

您可以找到所有以双引号开头,以逗号开头或字符串开头的字符串。

然后匹配双引号,直到遇到逗号或字符串末尾的下一个双引号。这些是白色的行,所以对于要捕获和替换的替代部分,请匹配双引号而不是逗号。

查找内容:

(?:^|,)"[^"\n]*"(?=$|,)|(?<!,)(")(?!,)

替换为:

有条件的替换。如果是第1组,则替换为空,否则用匹配项替换。

(?{1}:$0)

Regex demo

说明

  • (?:^|,)匹配逗号或声明字符串的开头
  • "[^"\n]*"在两者之间没有双引号的情况下匹配双引号
  • (?=$|,)断言右边的内容是字符串的结尾还是逗号
  • |
  • (?<!,)(")(?!,)在group1中使用双引号,同时断言左侧和右侧的内容不是逗号

答案 1 :(得分:1)

似乎可以与data.table::fread一起很好地工作:

fread("E:/temp/test.txt")
#   ID                                                                 Text Optional text     "Date"
#1:  1                                      Today is going to be a good day               2013-02-03
#2:  2        And I am inspired by the quote "every dog must have it's day"            Hi 2013-01-01
#3:  3 Did not the bard say "All the World's a stage" this quote is so true      Terrible 2013-05-05
#Warning message:
#In fread("E:/temp/test.txt") :
#  Found and resolved improper quoting in first 100 rows. If the fields are not quoted (e.g. field separator does not appear within any field), try quote="" to avoid this warning.