我有数百个大型CSV文件(尺寸从每行10k行到100k行不等),其中一些格式错误,引号都在引号内,所以它们可能看起来像
ID,Description,x
3434,"abc"def",988
2344,"fred",3484
2345,"fr""ed",3485
2346,"joe,fred",3486
我需要能够干净地将R中的所有这些行解析为CSV。 dput()'和阅读......
txt <- c("ID,Description,x",
"3434,\"abc\"def\",988",
"2344,\"fred\",3484",
"2345,\"fr\"\"ed\",3485",
"2346,\"joe,fred\",3486")
read.csv(text=txt[1:4], colClasses='character')
Error in read.table(file = file, header = header, sep = sep, quote = quote, :
incomplete final line found by readTableHeader on 'text'
如果我们更改引号并且不包含嵌入逗号的最后一行 - 它可以正常运行
read.csv(text=txt[1:4], colClasses='character', quote='')
但是,如果我们更改引号并包含嵌入逗号的最后一行...
read.csv(text=txt[1:5], colClasses='character', quote='')
Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, :
line 1 did not have 4 elements
编辑x2:应该说不幸的是,有些描述中有逗号 - 代码在上面进行了编辑。
答案 0 :(得分:5)
更改quote
设置:
read.csv(text=txt, colClasses='character',quote = "")
ID Description x
1 3434 "abc"def" 988
2 2344 "fred" 3484
3 2345 "fr""ed" 3485
4 2346 "joe" 3486
txt <- c("ID,Description,x",
"3434,\"abc\"def\",988",
"2344,\"fred\",3484",
"2345,\"fr\"\"ed\",3485",
"2346,\"joe,fred\",3486")
txt2 <- readLines(textConnection(txt))
txt2 <- strsplit(txt2,",")
txt2 <- lapply(txt2,function(x) c(x[1],paste(x[2:(length(x)-1)],collapse=","),x[length(x)]) )
m <- do.call("rbind",txt2)
df <- as.data.frame(m,stringsAsFactors = FALSE)
names(df) <- df[1,]
df <- df[-1,]
# ID Description x
# 2 3434 "abc"def" 988
# 3 2344 "fred" 3484
# 4 2345 "fr""ed" 3485
# 5 2346 "joe,fred" 3486
不知道,如果这对您的用例来说足够有效。
答案 1 :(得分:2)
由于这组令人讨厌的文件中只有一个引用列,我可以在每一侧执行read.csv()
来处理引用列左右两侧的其他未加引号的列,因此我当前的解决方案基于来自@agstudy和@roland
csv.parser <- function(txt) {
df <- do.call('rbind', regmatches(txt,gregexpr(',"|",',txt),invert=TRUE))
# remove the header
df <- df[-1,]
# parse the left csv
df1 <- read.csv(text=df[,1], colClasses='character', comment='', header=FALSE)
# parse the right csv
df3 <- read.csv(text=df[,3], colClasses='character', comment='', header=FALSE)
# put them back together
dfa <- cbind(df1, df[,2], df3)
# put the header back in
names(dfa) <- names(read.csv(text=txt[1], header=TRUE))
dfa
}
# debug(csv.parser)
csv.parser(txt)
所以,幸运的是,在更广泛的数据集上运行它。
txt <- c("ID,Description,x,y",
"3434,\"abc\"def\",988,344",
"2344,\"fred\",3484,3434",
"2345,\"fr\"\"ed\",3485,7347",
"2346,\"joe,fred\",3486,484")
csv.parser(txt)
ID Description x y
1 3434 abc"def 988 344
2 2344 fred 3484 3434
3 2345 fr""ed 3485 7347
4 2346 joe,fred 3486 484
答案 2 :(得分:1)
您可以使用readLines
并在regmatches
和,"
之间使用",
提取元素
ll <- readLines(textConnection(object='ID,Description,x
3434,"abc"def",988
2344,"fred",3484
2345,"fr""ed",3485
2346,"joe,fred",3486'))
ll<- ll[-1] ## remove the header
ll <- regmatches(ll,gregexpr(',"|",',ll),invert=TRUE)
do.call(rbind,ll)
[,1] [,2] [,3]
[1,] " 3434" "abc\"def" "988"
[2,] "2344" "fred" "3484"
[3,] "2345" "fr\"\"ed" "3485"
[4,] "2346" "joe,fred" "3486"