如何解析R中复杂的csv文件?

时间:2019-10-17 11:32:33

标签: r csv

我一直在努力解析一个复杂的* .csv文件,如下所示。它有6行,但是所有列标题,id和字段值均以单行列出,并用逗号分隔。大多数标头都有空格和单位“(in)”,必须直接使用。

示例文件:

    Version,1.1,1198    
    Dimension Unit,in,1000.000000
    Angle Unit,°,17.095780
    Measurement Names,Body height (in),Head height (in),Neck height (mm),Distance neck to buttock (in),Distance neck-knee (in)
    Measurement IDs,0010,0020,0030,0040,0050
    C:\Data\new.csv,796,,398,212

我想要的是:

  1. 跳过前3行
  2. 使用标题创建3个列:
    • “度量名称”(取自第4行),
    • “度量ID”(摘自#5行)和
    • “ C:\ Data \ new.csv”(最后一行)
  3. 将逗号处的所有字段分隔开,并将其移入右列
  4. 处理一些空字段值

正确解析后,其排列方式应为:

    Measurement Names             Measurement IDs  C:\Data\new.csv 
    Body height (in)              0010             796  
    Head height (in)              0020
    Neck height (in)              0030             398
    Distance neck to buttock (in) 0040             212
    Distance neck-knee (in)       0050                  

我已经尝试过read.csv,read.table,但成功读取的唯一东西是readLines。但是,即使给出了此选项,它也不会跳过前三行。同样,什么编码也没关系。 假设

data <- "C:\\Temp\\test.csv"
filename <- file(data,open="r")

Zeile <-readLines(filename,
                  skip=3,
                  #warn=FALSE,
                  encoding = 'utf-16-be'
                  )
Zeile <- strsplit(Zeile, ",") # here I try to split 
for (i in 1:length(Zeile)){
  print(Zeile[i])
}
close(filename)

结果如下:

[[1]]
[1] "Version" "1.1"     "1198\t"  

[[1]]
[1] "Length Unit" "mm"          "1000.000000"

[[1]]
[1] "Angle Unit" "°"         "57.295780" 

[[1]]
[1] "Measurement Names"             "Body height (mm)"              "Head height (mm)"             
[4] "Neck height (mm)"              "Distance neck to buttock (mm)" "Distance neck-knee (mm)"      

[[1]]
[1] "Measurement IDs" "0010"            "0020"            "0030"            "0040"           
[6] "0050"           

[[1]]
[1] "C:\\Data\\new.csv" "796"               ""                  "398"               "212"              

[[1]]
character(0)

有引号,并且字段值在正确的列中未对齐。

如何将预期结果放入数据框中以进行进一步处理?

2 个答案:

答案 0 :(得分:2)

您可以尝试以下方法:

library(tidyverse)
read.csv("path/your_file.csv", sep = ",", skip = 3, colClasses = "character") %>% 
   gather(Measurement_Names, v, -Measurement.Names) %>% 
   spread(Measurement.Names, v)
             Measurement_Names  C:\\Data\\new.csv Measurement IDs
1              Body.height..mm.               796            0010
2       Distance.neck.knee..mm.                              0050
3 Distance.neck.to.buttock..mm.               212            0040
4              Head.height..mm.                              0020
5              Neck.height..mm.               398            0030

答案 1 :(得分:1)

它不是R,但是我认为它可能有用。我正在使用Miller(https://github.com/johnkerl/miller)和csvtk(https://bioinf.shenwei.me/csvtk/)。

运行

tail -n +4 input_01.csv | mlr --nidx --fs "," cat -n  then unsparsify | csvtk transpose | tail -n +2

您将拥有

Measurement Names,Measurement IDs,C:\Data\new.csv
Body height (mm),0010,796
Head height (mm),0020,
Neck height (mm),0030,398
Distance neck to buttock (mm),0040,212
Distance neck-knee (mm),0050,

还是漂亮的跑步

tail -n +4 input_01.csv | mlr --nidx --fs "," cat -n  then unsparsify | csvtk transpose | tail -n +2 | mlr --c2p cat

拥有

Measurement Names             Measurement IDs C:\Data\new.csv
Body height (mm)              0010            796
Head height (mm)              0020            -
Neck height (mm)              0030            398
Distance neck to buttock (mm) 0040            212
Distance neck-knee (mm)       0050            -