Question

我有一个从MS SQL Server生成的csv文件，我试图读入R.它有如下数据：

# reproduce file
possibilities <- c('this is good','"this has, a comma"','here is a " quotation','')
newstrings <- expand.grid(possibilities,possibilities,possibilities,stringsAsFactors = F)
xwrite <- apply(newstrings,1,paste,collapse = ",")
xwrite <- c('v1,v2,v3',xwrite)
writeLines(xwrite,con = 'test.csv')

我通常会用Excel打开它，它会以更干净的格式为R读取和写入它，但这超出了行限制。如果我无法解决问题，我将不得不返回并以其他格式输出。我尝试过很多我已经读过的变化。

# a few things I've tried
(rl <- readLines('test.csv'))
read.csv('test.csv',header = T,quote = "",stringsAsFactors = F)
read.csv('test.csv',header = F,quote = "",stringsAsFactors = F,skip = 1)
read.csv('test.csv',header = T,stringsAsFactors = F)
read.csv('test.csv',header = F,stringsAsFactors = F,skip = 1)
read.table('test.csv',header = F)
read.table('test.csv',header = F,quote = "\"")
read.table('test.csv',header = T,sep = ",")
scan('test.csv',what = 'character')
scan('test.csv',what = 'character',sep = ",")
scan('test.csv',what = 'character',sep = ",",quote = "")
scan('test.csv',what = 'character',sep = ",",quote = "\"")

unlist(strsplit(rl,split = ','))

这似乎对我拥有的数据起作用，但是我对重用它感到不安，因为它在第六行上没有用，它说明了可能在另一个文件中发生的数据。

# works if only comma OR unpaired quotation but not both
rl[grep('^[^\"]*\"[^\"]*$',rl)] <- sub('^([^\"]*)(\")([^\"]*)$','\\1\\3',rl[grep('^[^\"]*\"[^\"]*$',rl)])
writeLines(rl,'testfixed.csv')
read.csv('testfixed.csv')

我找到了similar problem，但我的引号问题在数据中是一个孤独的问题，而不是一致的格式问题。

是否可以从中获取正确的data.frame？

Answer 1

我认为没有直接的方法可以做到这一点。在这里，我基本上使用逗号作为分隔符的strsplit。但首先，我会将特殊分隔符视为,\"或\",。

lines <- readLines('test.csv')
## separate teh quotaion case
lines_spe <- strsplit(lines,',\"|\",')
nn <- sapply(lines_spe,length)==1
## the normal case
lines[nn] <- strsplit(lines[nn],',',perl=TRUE)
## aggregate the results
lines[!nn] <- lines_spe[!nn]
## bind to create a data.frame
dat <-
setNames(as.data.frame(do.call(rbind,lines[-1]),stringsAsFactors =F),
         lines[[1]])
## treat the special case of strsplit('some text without second part,',',')
dat[dat$v1==dat$v2,"v2"] <- ""
dat
#                         v1                      v2
# 1             this is good            this is fine
# 2       this has no commas      this has, a comma"
# 3   this has no quotations  this has a " quotation
# 4 this field has something                        
# 5                          now the other side does
# 6       "this has, a comma  this has a " quotation
# 7         and a final line     that should be fine

结果几乎是好的，除了没有第二部分strsplit无法获得第二个空文本的情况：在您的数据中，这发生在：＆＃39;此字段有一些内容，＆＃ 39 ;.这里有一个例子来解释这个问题：

 strsplit('aaa,',',')
[[1]]
[1] "aaa"

> strsplit(',aaa',',')
[[1]]
[1] ""    "aaa"

Answer 2

这更接近并且可能会这样做。如果逗号旁边有一个单独的引号，它将失败，因为我假设这些是实际需要引用的字符串的开头或结尾。

rl <- readLines('test.csv')
rl <- gsub('([^,])(\")([^,])','\\1\\3',rl,perl = T)
writeLines(rl,'testfixed.csv')
read.csv('testfixed.csv')

使用配对和非配对引号读取csv

2 个答案: