Question

我一直在努力解析一个复杂的* .csv文件，如下所示。它有6行，但是所有列标题，id和字段值均以单行列出，并用逗号分隔。大多数标头都有空格和单位“（in）”，必须直接使用。

示例文件：

    Version,1.1,1198    
    Dimension Unit,in,1000.000000
    Angle Unit,°,17.095780
    Measurement Names,Body height (in),Head height (in),Neck height (mm),Distance neck to buttock (in),Distance neck-knee (in)
    Measurement IDs,0010,0020,0030,0040,0050
    C:\Data\new.csv,796,,398,212

我想要的是：

跳过前3行
使用标题创建3个列：
- “度量名称”（取自第4行），
- “度量ID”（摘自＃5行）和
- “ C：\ Data \ new.csv”（最后一行）
将逗号处的所有字段分隔开，并将其移入右列
处理一些空字段值

正确解析后，其排列方式应为：

    Measurement Names             Measurement IDs  C:\Data\new.csv 
    Body height (in)              0010             796  
    Head height (in)              0020
    Neck height (in)              0030             398
    Distance neck to buttock (in) 0040             212
    Distance neck-knee (in)       0050

我已经尝试过read.csv，read.table，但成功读取的唯一东西是readLines。但是，即使给出了此选项，它也不会跳过前三行。同样，什么编码也没关系。假设

data <- "C:\\Temp\\test.csv"
filename <- file(data,open="r")

Zeile <-readLines(filename,
                  skip=3,
                  #warn=FALSE,
                  encoding = 'utf-16-be'
                  )
Zeile <- strsplit(Zeile, ",") # here I try to split 
for (i in 1:length(Zeile)){
  print(Zeile[i])
}
close(filename)

结果如下：

[[1]]
[1] "Version" "1.1"     "1198\t"  

[[1]]
[1] "Length Unit" "mm"          "1000.000000"

[[1]]
[1] "Angle Unit" "Â°"         "57.295780" 

[[1]]
[1] "Measurement Names"             "Body height (mm)"              "Head height (mm)"             
[4] "Neck height (mm)"              "Distance neck to buttock (mm)" "Distance neck-knee (mm)"      

[[1]]
[1] "Measurement IDs" "0010"            "0020"            "0030"            "0040"           
[6] "0050"           

[[1]]
[1] "C:\\Data\\new.csv" "796"               ""                  "398"               "212"              

[[1]]
character(0)

有引号，并且字段值在正确的列中未对齐。

如何将预期结果放入数据框中以进行进一步处理？

Answer 1

您可以尝试以下方法：

library(tidyverse)
read.csv("path/your_file.csv", sep = ",", skip = 3, colClasses = "character") %>% 
   gather(Measurement_Names, v, -Measurement.Names) %>% 
   spread(Measurement.Names, v)
             Measurement_Names  C:\\Data\\new.csv Measurement IDs
1              Body.height..mm.               796            0010
2       Distance.neck.knee..mm.                              0050
3 Distance.neck.to.buttock..mm.               212            0040
4              Head.height..mm.                              0020
5              Neck.height..mm.               398            0030

Answer 2

它不是R，但是我认为它可能有用。我正在使用Miller（https://github.com/johnkerl/miller）和csvtk（https://bioinf.shenwei.me/csvtk/）。

运行

tail -n +4 input_01.csv | mlr --nidx --fs "," cat -n  then unsparsify | csvtk transpose | tail -n +2

您将拥有

Measurement Names,Measurement IDs,C:\Data\new.csv
Body height (mm),0010,796
Head height (mm),0020,
Neck height (mm),0030,398
Distance neck to buttock (mm),0040,212
Distance neck-knee (mm),0050,

还是漂亮的跑步

tail -n +4 input_01.csv | mlr --nidx --fs "," cat -n  then unsparsify | csvtk transpose | tail -n +2 | mlr --c2p cat

拥有

Measurement Names             Measurement IDs C:\Data\new.csv
Body height (mm)              0010            796
Head height (mm)              0020            -
Neck height (mm)              0030            398
Distance neck to buttock (mm) 0040            212
Distance neck-knee (mm)       0050            -

如何解析R中复杂的csv文件？

2 个答案: