由于存在\符号

时间:2018-12-13 02:13:17

标签: r csv dataframe read.csv sf

对于以下描述的问题是否模棱两可,在没有可复制的示例的意义上,我深表歉意,但问题与.csv文件的格式有关。

我正在尝试使用read.csv函数从笔记本电脑中向R打开多个.csv文件。该函数中使用的当前参数为:

read.csv(filename, row.names = NULL, stringsAsFactors = F) 

row.names设置为NULL,因为之前我遇到了另一个错误(重复的row.names)。

导入其中一些时,我收到以下警告:

Warning message:
In scan(file = file, what = what, sep = sep, quote = quote, dec = dec,  :
  EOF within quoted string

我检查了带有警告的输入所导致的数据帧,并注意到许多行(在我所描述的情况下,大约占道路总数的10%)已被NA取代。在以下使用dput()生成的ascii文本中,我显示了具有NA值的行,以及该行顶部和下方的行(链接是指向数据帧图像的外观)。

data <- structure(list(borough_17 = c("MN", "0", "MN"), block_17 = c(1602L,  NA, 1602L), lot_17 = c(57L, NA, 1L), cd_17 = c(111L, NA, 111L ), ct2010_17 = c(160.02, NA, 160.02), cb2010_17 = c(1001L, NA,  1001L), schooldist_17 = c(2L, NA, 2L), council_17 = c(4L, NA,  4L), zipcode_17
= c(10029L, NA, 10128L), firecomp_17 = c("E053",  "", "E053"), policeprct_17 = c(23, NA, 23), healthcent_17 = c(12L,  NA, 12L), healtharea_17 = c(2820L, NA, 2820L), sanitboro_17 = c(1L,  NA, 1L), sanitdistr_17 = c(11L, NA, 11L), sanitsub_17 = c("1A",  "", "1A"), address_17 = c("1392 MADISON AVENUE", "", "1150 5 AVENUE" ), zonedist1_17 = c("R7-2", "", "R10"), zonedist2_17 = c(NA,  "", NA), zonedist3_17 = c(NA_integer_, NA_integer_, NA_integer_ ), zonedist4_17
= c(NA, "", NA), overlay1_17 = c("C1-5", "",  NA), overlay2_17 = c(NA, "", NA), spdist1_17 = c(NA, "", "PI" ), spdist2_17 = c(NA, "", NA), spdist3_17 = c(NA_integer_, NA_integer_,  NA_integer_), ltdheight_17 = c(NA, NA, NA), splitzone_17 = c("N",  "", "N"), bldgclass_17 = c("C7", "", "D4"), landuse_17 = c(4L,  NA, 3L), easements_17 = c(0, NA, 0), ownertype_17 = c("P", "",  NA), ownername_17 = c("\\1392 ASSOCIATES", "", "1150 FIFTH AVE OWNERS" ), lotarea_17 = c(" LLC,4892,20488,1300,19188,0,1300,0,0,0,0,2,1,6,27,30,50,103.67,50,86,NA,3,Y,5,2,234000,1781100,0,0,1910,1988,0,NA,NA,4.19,3.44,0,6.5,1,1016020057,0,16002,996952,226284,6b,NA,108S043,10602,NA,0,NA,1,17V1,0,303.48084987,5045.80705517,0,0\nMN,1602,40,111,160.02,1000,2,4,10029,E053,23,12,2820,1,11,1Q,68 EAST 97 STREET,R7-2,NA,NA,NA,C1-5,NA,NA,NA,NA,NA,N,C1,02,0,NA,MSMC RESIDENTIAL REAL,5046,21288,2767,18521,0,0,0,0,0,2767,2,1,6,30,30,50,100.92,50,88,NA,3,N,5,2,98100,722250,19620,144450,1920,1988,0,NA,NA,4.22,3.44,0,6.5,1,1016020040,0,16002,997361,226090,6b,NA,108S043,10602,NA,0,NA,1,17V1,0,311.056597351,5263.55635194,0,0\nMN,1602,12,111,160.02,1001,2,4,10128,E053,23,12,2820,1,11,1A,15 EAST 96 STREET,R10,NA,NA,NA,NA,NA,PI,NA,NA,NA,N,A7,01,0,P,ESI PROPERTIES L.L.C.,3784,10899,0,10899,0,0,0,0,0,0,2,1,6,1,1,37.5,100.92,37,100,NA,3,N,5,2,99694,312592,0,0,1915,0,0,Expanded Carnegie Hill Historic District,LUCY D. DAHLGREN HOUSE,2.88,10,0,10,1,1016020012,0,16002,996835,226261,6b,NA,108S043,10602,NA,0,NA,1,17V1,0,284.454033558,3945.34293573,0,0\nMN,1602,50,111,160.02,1000,2,4,10029,E053,23,12,2820,1,11,1A,1391 MADISON AVENUE,R7-2,NA,NA,NA,C1-5,NA,NA,NA,NA,NA,N,D7,04,0,P,MSMC RESIDENTIAL REAL,10000,44184,8673,35511,0,6000,0,0,0,2673,2,1,6,43,47,100,100,100,91,NA,2,N,3,2,293400,3357000,50792,695697,1920,0,0,NA,NA,4.42,3.44,0,6.5,1,1016020050,0,16002,997119,226224,6b,NA,108S043,10602,NA,0,NA,1,17V1,0,420.450320951,11048.1944701,0,0\nMN,1602,58,111,160.02,1001,2,4,10029,E053,23,12,2820,1,11,1A,1396 MADISON AVENUE,R7-2,NA,NA,NA,C1-5,NA,NA,NA,NA,NA,N,C7,04,0,NA,\\1392 ASSOCIATES, LLC",  "", "15138"), bldgarea_17 = c("4415", "", "163969"), comarea_17 = c(21072,  NA, 1000), resarea_17 = c(1500, NA, 162969), officearea_17 = c(19572,  NA, 1000), retailarea_17 = c(0L, NA, 0L), garagearea_17 = c(1500L,  NA, 0L), strgearea_17 = c(0L, NA, 0L), factryarea_17 = c(0L,  NA, 0L), otherarea_17 = c(0L, NA, 0L), areasource_17 = c(0L,  NA, 2L), numbldgs_17 = c("2", "", "1"), numfloors_17 = c(1, NA,  15), unitsres_17 = c("6", "", "74"), unitstotal_17 = c(28L, NA,  77L), lotfront_17 = c(32, NA, 100.92), lotdepth_17 = c(50.92,  NA, 150), bldgfront_17 = c(81.42, NA, 100), bldgdepth_17 = c(50,  NA, 142), ext_17 = c("73", "", NA), proxcode_17
= c(NA, "", "1" ), irrlotcode_17 = c("2", "", "N"), lottype_17 = c("Y", "", "3" ), bsmtcode_17 = c("3", "", "2"), assessland_17 = c(2, NA, 769500 ), assesstot_17 = c(278100L, NA, 18590400L), exemptland_17
= c(2135700,  NA, 9440), exempttot_17 = c(0L, NA, 9440L), yearbuilt_17 = c(0L,  NA, 1924L), yearalter1_17 = c(1910L, NA, 1988L), yearalter2_17 = c(0L,  NA, 0L), histdist_17 = c("0", "", "Expanded Carnegie Hill Historic District" ), landmark_17 = c(NA, "", NA), builtfar_17 = c(NA, "", "10.83" ), residfar_17 = c(4.77, NA, 10), commfar_17 = c(3.44, NA, 0), 
    facilfar_17 = c(0, NA, 10), borocode_17 = c(6.5, NA, 1), 
    bbl_17 = c(1, NA, 1016020001), condono_17 = c(1016020058L, 
    NA, 0L), tract2010_17 = c(0L, NA, 16002L), xcoord_17 = c(16002L, 
    NA, 996653L), ycoord_17 = c(996984L, NA, 226361L), zonemap_17 = c("226326", 
    "", "6b"), zmcode_17 = c("6b", "", NA), sanborn_17 = c(NA, 
    "", "108S043"), taxmap_17 = c("108S043", "", "10602"), edesignum_17 = c("10602", 
    "", NA), appbbl_17 = c(NA, "", "0"), appdate_17 = c("0", 
    "", NA), plutomapid_17 = c(NA, "", "1"), version_17 = c("1", 
    "", "17V1"), mappluto_f_17 = c("17V1", "", "0"), shape_leng_17 = c("0", 
    "", "514.691112825"), shape_area_17 = c(285.694168916, NA, 
    15970.2343838), sfha_07_17 = c("4794.40305456", "", "0"), 
    sfha_15_17 = c(0, NA, 0)), .Names = c("borough_17", "block_17",  "lot_17", "cd_17", "ct2010_17", "cb2010_17", "schooldist_17",  "council_17", "zipcode_17", "firecomp_17", "policeprct_17", "healthcent_17",  "healtharea_17", "sanitboro_17", "sanitdistr_17", "sanitsub_17",  "address_17", "zonedist1_17", "zonedist2_17", "zonedist3_17",  "zonedist4_17", "overlay1_17", "overlay2_17", "spdist1_17", "spdist2_17",  "spdist3_17", "ltdheight_17", "splitzone_17", "bldgclass_17",  "landuse_17", "easements_17", "ownertype_17", "ownername_17",  "lotarea_17", "bldgarea_17", "comarea_17", "resarea_17", "officearea_17",  "retailarea_17", "garagearea_17", "strgearea_17", "factryarea_17",  "otherarea_17", "areasource_17", "numbldgs_17", "numfloors_17",  "unitsres_17", "unitstotal_17", "lotfront_17", "lotdepth_17",  "bldgfront_17", "bldgdepth_17", "ext_17", "proxcode_17", "irrlotcode_17",  "lottype_17", "bsmtcode_17", "assessland_17", "assesstot_17",  "exemptland_17", "exempttot_17", "yearbuilt_17", "yearalter1_17",  "yearalter2_17", "histdist_17", "landmark_17", "builtfar_17",  "residfar_17", "commfar_17", "facilfar_17", "borocode_17", "bbl_17",  "condono_17", "tract2010_17", "xcoord_17", "ycoord_17", "zonemap_17",  "zmcode_17", "sanborn_17", "taxmap_17", "edesignum_17", "appbbl_17",  "appdate_17", "plutomapid_17", "version_17", "mappluto_f_17",  "shape_leng_17", "shape_area_17", "sfha_07_17", "sfha_15_17"), row.names = 13:15, class = "data.frame")

Row with NA values

我研究了我的数据框,发现“ lotarea_17”字段似乎有问题。在此字段中,第一条记录是一个包含许多值的字符串。逗号尚未分隔,许多列保留在一个列中(下图显示了数据框的样子):

A look at the row on top of the NA-filled row

我已经签入了.csv文件,并且我意识到,当在前一行的特定列中的一个\符号代表被NA填充的那个符号时,就会发生问题。可能是该符号取消了其中一个符号,导致整个行都被视为一个字符串。下面,我展示了.csv文件的两张图片,其中一张描述了问题,另一张效果很好,记录相同,但写法不同。出现问题的行的逗号分隔值似乎表示:

...,"\"1392 ASSOCIATES, LLC",4892,20488,1300,19188,...

其中一个显示出来:

"...,1392 ASSOCIATES, LLC",4892,20488,1300,19188,...

csv with the problem A csv file with the same row, that worked when importing it

因此,我的问题是,是否可以在R或Excel中(最好是在R中的read.csv函数)进行快速修复,以自动解决此问题,其中片段“ \”出现了错误与我的数据结构有关。手动修复不是一种选择,因为我们仅用一个df就讨论了200多个案例(而我在此问题上还有其他案例)。

添加-.csv文件是通过写入几个sf对象的数据帧创建的。命令行是`write.table(as.data.frame(sf_object),filename,sep =“,”,row.names =F。这些SF对象是通过空间连接MapPLUTO的shapefile(NYC的土地利用地图数据库)而创建的)和另一个shapefile(但我知道该连接没有问题,因为从那里继承的列是完美的)。

1 个答案:

答案 0 :(得分:1)

您正在尝试读取的csv文件似乎不符合官方的csv标准(RFC-4180),因此,我认为如果通过蛮力解决问题是最简单的。

我们可以使用readLines读取原始数据,gsub编辑不兼容的部分,然后使用read.csv(text=...)转换为data.frame。我不是100%肯定会在第一次尝试中就可以使用,但是这样可以更轻松地进行调试。一个小例子的工作原理:

raw <- readLines(filename)
raw <- gsub('\\\\"', '""', raw)
result <- read.csv(text=raw)