在R中读取csv,格式错误?

时间:2016-01-03 11:59:26

标签: r csv

我有以下csv文件:

"ID,""oldid"",""country"",""side_a"",""densdiff"
"10,32,""Afghanistan"",""Afghanistan"",""Various organizations"

在练习中我们得到了一些csv文件,其中总是格式化"清除" e.g。

"ID","oldid","country" ...
"10","32","Afghanistan" ...

我发现,分隔符是","但它有时是一个字符串(" ID,"),有时没有分隔符 (好像 :     " intden"""" densdiff""" 所以我不知道如何处理最后两个引号)

我没有找到一个很好的网站来解释" mixed-csv-formatted"输入R.

编辑:这是完整的标题和第一行:

"ID,""oldid"",""country"",""side_a"",""side_b"",""cow"",""incompatibility"",""terr"",""begin"",""end"",""type"",""identity"",""radius"",""confarea"",""landarea"",""confland"",""rel_scope"",""distance"",""maxdist"",""mindist"",""disper"",""pop2000"",""resource"",""mountain"",""forest"",""border"",""mindisx"",""lnmndist"",""confarex"",""ln_abs_scope"",""ln_land_area"",""lnpop"",""lnconpro"",""duration"",""distx"",""location"",""mountx"",""frstx"",""lnmountx"",""lnfrstx"",""diamond"",""diadist"",""gold"",""golddist"",""oil"",""oildist"",""roadpave"",""roadtot"",""pavetot"",""paveland"",""roadland"",""disxsqr"",""mndisxsq"",""stabilit"",""rulelaw"",""nocorrup"",""lnd100km"",""pop100km"",""lnd100cr"",""pop100cr"",""landlock"",""ciffob95"",""coastden"",""intden"",""densdiff"""

下一行:

"10,32,""Afghanistan"",""Afghanistan"",""Various organizations"",700,2,"""",1978,2000,3,1,400,500,652,77,77,122,522,0,0.509999990463257,27,0,66,3,1,1,0,500,6.21460819244385,6.4800443649292,3.29583692550659,0.959037899971008,23,122,4.80402088165283,66,3,4.18965482711792,1.0986123085022,0,NA,0,NA,0,NA,2.79999995231628,21,13.3333330154419,0.429447859525681,3.22085881233215,14884,1,NA,NA,NA,0,0,0,0,1,NA,0,36,-36"

编辑2: 经过大量的trubbleshooting我只下载了csv文件,现在它很干净。在询问我的讲师后,我会发表评论。感谢所有的帮助:)

4 个答案:

答案 0 :(得分:1)

你可以尝试一下吗?您需要read_lines函数的readr包。

> x <- read_lines("data.csv") #Read the dirty quotes csv file

> x                           # Display contents
    [1] "\"ID,\"\"oldid\"\",\"\"country\"\",\"\"side_a\"\",\"\"densdiff\""           
    [2] "\"10,32,\"\"Afghanistan\"\",\"\"Afghanistan\"\",\"\"Various organizations\""

> x2 <- textConnection(gsub('"', "", x)) # Replace all " with null and create a connection object

> x3 <- read.csv(x2, header=TRUE) # Read the conn object as you would a regular file

> x3
      ID oldid     country      side_a              densdiff
    1 10    32 Afghanistan Afghanistan Various organizations

答案 1 :(得分:1)

mOnClickListener = new View.OnClickListener() {
            @Override
            public void onClick(View v) {
                snackBar.dismiss(); // to close the snackbar
                // startActivity(nextActivityIntent)
            }
        };

这个csv被写成整行是一个字段,并用引号括起来。因此,内部报价会被额外报价转义。

因此,它实际上是一个从已经格式良好的csv文件(或数据)生成的csv文件,现在整行都转换为单个字段。

这可能首先在源处修复。

要在之后修复,应该读入行并将其解析为一个csv字段。然后是解析字段的内容(现在应该删除所有额外的引号)

"ID,""oldid"",""country"",""side_a"",""densdiff"  .."
"10,32,""Afghanistan"",""Afghanistan"",""Various organizations"  .."

应该再次被处理并解析为完整的csv行。

答案 2 :(得分:1)

正如David Arenburg在评论中所说,你应该尝试这样的事情:

> read.csv(text = gsub("\"", "", readLines("file.csv")))
  ID oldid     country      side_a              densdiff
1 10    32 Afghanistan Afghanistan Various organizations

答案 3 :(得分:0)

正确的CSV应如下所示:

12,13,"abc","def"

以下应该清理它,因为格式对应于整个示例,并且字符串中没有任何逗号:

cat my.csv | sed 's/,"/,/' | sed 's/","/,/g' | sed 's/^"//' > mynew.csv