如何转换/重新格式化CSV文件?

时间:2019-05-23 18:20:46

标签: r csv unix

我有一个格式奇怪的csv文件,该文件是我需要672行数据的乐器的输出。它具有多个样本,化合物的输出浓度垂直排列。看起来像这样:

"Sample 1"
"Compound A", 1
"Compound B", 1
"Compound C", 1
"Sample 2"
"Compound A", 3
"Compound B", 3
"Compound C", 3
"Sample 3"
"Compound A", 2
"Compound B", 2
"Compound C", 2

老实说,我不知道从实现这个目标开始。我通常会在R中进行这种类型的转换,但是读入R时文件格式仍然很笨拙。

在R中,当使用以下命令读取csv文件时:         Test <- read.csv("Test.csv", sep=",", header=FALSE)

我得到以下信息:

              V1      V2         
      1    Sample 1    NA   
      2    Compound A  1     
      3    Compound B  1   
      4    Compound C  1      
      5    Sample 2    NA     
      6    Compound A  3     
      7    Compound B  3       
      8    Compound C  3
      9    Sample 2    NA     
     10    Compound A  2     
     11    Compound B  2       
     12    Compound C  2      

我希望获得一个输出文件,该文件具有“样品”作为列,而“化合物”作为行,且每种浓度均正确。例如:

           Sample 1  Sample 2  Sample 3
Compound 1     1        3          2
Compound 2     1        3          2
Compound 3     1        3          2

因此,R解决方案或unix解决方案将可以正常工作,因为我可以将数据帧写入文本文件并在bash终端中使用它。

2 个答案:

答案 0 :(得分:3)

R也是清理数据的好语言。我会做这样的事情:

df <- read.csv('/tmp/data', header=F)
v <- seq_len(nrow(df))
v[!is.na(df$V2)] <- NA
v <- zoo::na.locf(v)
df$sample <- df$V1[v]
df <- df[!is.na(df$V2),]

这将zoo::na.locf用于主要任务,当一行的内容必须影响后续行时,我总是发现这是一个不错的选择。

现在您有一个data.frame,其中包含样本编号的列:

           V1 V2   sample
2  Compound A  1 Sample 1
3  Compound B  1 Sample 1
4  Compound C  1 Sample 1
6  Compound A  3 Sample 2
7  Compound B  3 Sample 2
8  Compound C  3 Sample 2
10 Compound A  2 Sample 3
11 Compound B  2 Sample 3
12 Compound C  2 Sample 3

使用“从高”到“宽”格式的选项之一应该可以帮助您解决其他问题:

> reshape(df, idvar='V1', direction='wide', timevar='sample')
          V1 V2.Sample 1 V2.Sample 2 V2.Sample 3
2 Compound A           1           3           2
3 Compound B           1           3           2
4 Compound C           1           3           2

答案 1 :(得分:0)

这是一种tidyverse方法,用于在R中完成相同的清理操作。我们可以:

  1. read_lines文件以获取每个元素只有一行的字符向量
  2. str_remove_all每行中的文字引号
  3. 将行放入tibble(数据框)列中
  4. str_detect每行是包含数据的复合行,还是只是Sample标题。使用cumsum用正确的样本号标记“化合物”行,然后filter清除标题
  5. separate来自浓度值的化合物标识符
  6. spread将数据输出为多种格式。
library(tidyverse)
file <- read_lines(
'"Sample 1"
"Compound A", 1
"Compound B", 1
"Compound C", 1
"Sample 2"
"Compound A", 3
"Compound B", 3
"Compound C", 3
"Sample 3"
"Compound A", 2
"Compound B", 2
"Compound C", 2'
)
file %>%
  str_remove_all("\"") %>%
  tibble(line = .) %>%
  mutate(sample =  str_detect(line, "Sample") %>% cumsum %>% str_c("Sample_", .)) %>%
  filter(!str_detect(line, "Sample")) %>%
  separate(line, c("compound", "concentration"), sep = ", ") %>%
  spread(sample, concentration)
#> # A tibble: 3 x 4
#>   compound   Sample_1 Sample_2 Sample_3
#>   <chr>      <chr>    <chr>    <chr>   
#> 1 Compound A 1        3        2       
#> 2 Compound B 1        3        2       
#> 3 Compound C 1        3        2

reprex package(v0.3.0)于2019-05-23创建