在sqldf中处理CSV文件中的逗号

时间:2018-06-17 02:27:56

标签: r csv comma sqldf

我在此处sqldf returns zero observations跟进了我的问题并提供了一个可重复的示例。

我发现问题可能来自"逗号"在其中一个单元格中(" 1,500 +")我认为我必须使用此处建议的过滤器sqldf, csv, and fields containing commas,但我不确定如何定义我的过滤器。以下是代码:

sc.parallelize(List(Person("a", 1), Person("b", 2)))

当我运行此代码时,出现以下错误:

connection_import_file中的错误(conn @ptr,name,value,sep,eol,skip):RS_sqlite_import:df_to_read.csv第2行预期7列数据但找到8

2 个答案:

答案 0 :(得分:0)

问题来自于阅读df$b创建的列。该列中的第一个值包含逗号,因此sqldf()函数将其视为分隔符。 处理此问题的一种方法是删除逗号或使用其他符号(如空格)。您还可以使用read.csv2.sql函数:

library(sqldf)

df <- data.frame("a" = c("8600000US01770" , "8600000US01937"),
                 "b"= c("1,500+" , "-"),
                 "c"= c("***" , "**"),
                 "d"= c("(x)" , "(x)"),
                 "e"= c("(x)" , "(x)"),
                 "f"= c("992" , "-"))

write.csv(df, 'df_to_read.csv',row.names = FALSE )


Housing_filtered <- read.csv2.sql("df_to_read.csv", sql = "select * from file", header=TRUE)

答案 1 :(得分:0)

最好的方法是清理一次文件,这样您以后就不必担心同样的问题。这应该让你去:

Housing <- readLines("df_to_read.csv")                            # read the file

n <- 6             # number of separators expected = number of columns expected - 1

library(stringr)
ln_idx <- ifelse(str_count(Housing, pattern = ",") == n, 0 , 1)
which(ln_idx == 1)               # line indices with issue, includes the header row
#[1] 2

检查具体问题并以相同的索引回写给您。例如,行(2)

Housing[2]
#[1] "1,8600000US01770,1,500+,***,(x),(x),992"            # hmm.. extra comma

Housing[2] = "1,8600000US01770,1500+,***,(x),(x),992"     # removed the extra comma
writeLines(Housing, "df_to_read.csv")

现在业务很平常,很高兴:

Housing <- file("df_to_read.csv")
Housing_filtered <- sqldf('SELECT * FROM Housing') 

# Housing_filtered 
#               a      b   c   d   e   f
# 1 8600000US01770  1500+ *** (x) (x) 992
# 2 8600000US01937      -  ** (x) (x)   -