我一直在努力解析一个复杂的* .csv文件,如下所示。它有6行,但是所有列标题,id和字段值均以单行列出,并用逗号分隔。大多数标头都有空格和单位“(in)”,必须直接使用。
示例文件:
Version,1.1,1198
Dimension Unit,in,1000.000000
Angle Unit,°,17.095780
Measurement Names,Body height (in),Head height (in),Neck height (mm),Distance neck to buttock (in),Distance neck-knee (in)
Measurement IDs,0010,0020,0030,0040,0050
C:\Data\new.csv,796,,398,212
我想要的是:
正确解析后,其排列方式应为:
Measurement Names Measurement IDs C:\Data\new.csv
Body height (in) 0010 796
Head height (in) 0020
Neck height (in) 0030 398
Distance neck to buttock (in) 0040 212
Distance neck-knee (in) 0050
我已经尝试过read.csv,read.table,但成功读取的唯一东西是readLines。但是,即使给出了此选项,它也不会跳过前三行。同样,什么编码也没关系。 假设
data <- "C:\\Temp\\test.csv"
filename <- file(data,open="r")
Zeile <-readLines(filename,
skip=3,
#warn=FALSE,
encoding = 'utf-16-be'
)
Zeile <- strsplit(Zeile, ",") # here I try to split
for (i in 1:length(Zeile)){
print(Zeile[i])
}
close(filename)
结果如下:
[[1]]
[1] "Version" "1.1" "1198\t"
[[1]]
[1] "Length Unit" "mm" "1000.000000"
[[1]]
[1] "Angle Unit" "°" "57.295780"
[[1]]
[1] "Measurement Names" "Body height (mm)" "Head height (mm)"
[4] "Neck height (mm)" "Distance neck to buttock (mm)" "Distance neck-knee (mm)"
[[1]]
[1] "Measurement IDs" "0010" "0020" "0030" "0040"
[6] "0050"
[[1]]
[1] "C:\\Data\\new.csv" "796" "" "398" "212"
[[1]]
character(0)
有引号,并且字段值在正确的列中未对齐。
如何将预期结果放入数据框中以进行进一步处理?
答案 0 :(得分:2)
您可以尝试以下方法:
library(tidyverse)
read.csv("path/your_file.csv", sep = ",", skip = 3, colClasses = "character") %>%
gather(Measurement_Names, v, -Measurement.Names) %>%
spread(Measurement.Names, v)
Measurement_Names C:\\Data\\new.csv Measurement IDs
1 Body.height..mm. 796 0010
2 Distance.neck.knee..mm. 0050
3 Distance.neck.to.buttock..mm. 212 0040
4 Head.height..mm. 0020
5 Neck.height..mm. 398 0030
答案 1 :(得分:1)
它不是R,但是我认为它可能有用。我正在使用Miller(https://github.com/johnkerl/miller)和csvtk(https://bioinf.shenwei.me/csvtk/)。
运行
tail -n +4 input_01.csv | mlr --nidx --fs "," cat -n then unsparsify | csvtk transpose | tail -n +2
您将拥有
Measurement Names,Measurement IDs,C:\Data\new.csv
Body height (mm),0010,796
Head height (mm),0020,
Neck height (mm),0030,398
Distance neck to buttock (mm),0040,212
Distance neck-knee (mm),0050,
还是漂亮的跑步
tail -n +4 input_01.csv | mlr --nidx --fs "," cat -n then unsparsify | csvtk transpose | tail -n +2 | mlr --c2p cat
拥有
Measurement Names Measurement IDs C:\Data\new.csv
Body height (mm) 0010 796
Head height (mm) 0020 -
Neck height (mm) 0030 398
Distance neck to buttock (mm) 0040 212
Distance neck-knee (mm) 0050 -