使用readr解析具有不规则引用规则的CSV

时间:2019-02-11 20:06:37

标签: r regex tidyverse readr

我有一个无法使用readr解析的奇怪CSV。我们称之为data.csv。看起来像这样:

name,info,amount_spent
John Doe,Is a good guy,5412030
Jane Doe,"Jan Doe" is cool,3159
Senator Sally Doe,"Sally "Sal" Doe is from New York, NY",4451

如果所有行都像列行下面的第一个行一样-两个字符列,然后是整数列-这很容易用read_csv来解析:

df <- read_csv("data.csv")

但是,某些行的格式类似于第二行,因为第二列(“ info”)包含一个字符串,其中一部分用双引号引起来,而另一部分则没有。这样一来,read_csv就不会在单词cool作为分隔符之后读取逗号,并且随后的整行都附加到有问题的单元格上。

此类问题的一种解决方案是将FALSE传递给escape_double中的read_delim()自变量,就像这样:

df <- read_delim("data.csv", delim = ",", escape_double = FALSE)

这适用于第二行,但被第三行杀死,第二列包含用双引号括起来的字符串,该字符串本身包含嵌套的双引号逗号。

我已经阅读了readr文档,但是还没有找到可以解析这两种类型的行的解决方案。

2 个答案:

答案 0 :(得分:0)

在这里使用指定的示例对我有用。

使用read.csv而不是read_csv。 这意味着我使用的是数据框,而不是小标题。

#Read the csv, just turned the table you had as an example to a csv.
#That resulted as a csv with one column
a <- read.csv(file = "Book1.csv", header=T) 

#Replace the comma in the third(!) line with just space
a[,1] <-  str_replace_all(as.vector(a[,1]), ", ", " ")

#Use seperate from the tidyer package to split the column to three columns
#and convert to a tibble
a <- a %>% separate(name.info.amount_spent, c("name", "info", "spent"), ",")%>%
as_tibble(a)
glimpse(a)
 $name  <chr> "John Doe", "Jane Doe", "Senator Sally Doe"
 $info  <chr> "Is a good guy", "\"Jan Doe\" is cool", "\"Sally \"Sal\" Doe is from New York NY\""
 $spent <chr> "5412030", "3159", "4451"

答案 1 :(得分:0)

您可以使用正则表达式(使用(*SKIP)(*FAIL)对有关逗号进行拆分):

input <- c('John Doe,Is a good guy,5412030', 'Jane Doe,"Jan Doe" is cool,3159',
           'Senator Sally Doe,"Sally "Sal" Doe is from New York, NY",4451')

lst <- strsplit(input, '"[^"]*"(*SKIP)(*FAIL)|,', perl = T)

(df <- setNames(as.data.frame(do.call(rbind, lst)), c("name","info","amount_spent")))

这产生

               name                                   info amount_spent
1          John Doe                          Is a good guy      5412030
2          Jane Doe                      "Jan Doe" is cool         3159
3 Senator Sally Doe "Sally "Sal" Doe is from New York, NY"         4451

查看有关regex101.com上表达式的演示。