Question

我有一个格式奇怪的csv文件，该文件是我需要672行数据的乐器的输出。它具有多个样本，化合物的输出浓度垂直排列。看起来像这样：

"Sample 1"
"Compound A", 1
"Compound B", 1
"Compound C", 1
"Sample 2"
"Compound A", 3
"Compound B", 3
"Compound C", 3
"Sample 3"
"Compound A", 2
"Compound B", 2
"Compound C", 2

老实说，我不知道从实现这个目标开始。我通常会在R中进行这种类型的转换，但是读入R时文件格式仍然很笨拙。

在R中，当使用以下命令读取csv文件时： Test <- read.csv("Test.csv", sep=",", header=FALSE)

我得到以下信息：

              V1      V2         
      1    Sample 1    NA   
      2    Compound A  1     
      3    Compound B  1   
      4    Compound C  1      
      5    Sample 2    NA     
      6    Compound A  3     
      7    Compound B  3       
      8    Compound C  3
      9    Sample 2    NA     
     10    Compound A  2     
     11    Compound B  2       
     12    Compound C  2

我希望获得一个输出文件，该文件具有“样品”作为列，而“化合物”作为行，且每种浓度均正确。例如：

           Sample 1  Sample 2  Sample 3
Compound 1     1        3          2
Compound 2     1        3          2
Compound 3     1        3          2

因此，R解决方案或unix解决方案将可以正常工作，因为我可以将数据帧写入文本文件并在bash终端中使用它。

Answer 1

R也是清理数据的好语言。我会做这样的事情：

df <- read.csv('/tmp/data', header=F)
v <- seq_len(nrow(df))
v[!is.na(df$V2)] <- NA
v <- zoo::na.locf(v)
df$sample <- df$V1[v]
df <- df[!is.na(df$V2),]

这将zoo::na.locf用于主要任务，当一行的内容必须影响后续行时，我总是发现这是一个不错的选择。

现在您有一个data.frame，其中包含样本编号的列：

           V1 V2   sample
2  Compound A  1 Sample 1
3  Compound B  1 Sample 1
4  Compound C  1 Sample 1
6  Compound A  3 Sample 2
7  Compound B  3 Sample 2
8  Compound C  3 Sample 2
10 Compound A  2 Sample 3
11 Compound B  2 Sample 3
12 Compound C  2 Sample 3

使用“从高”到“宽”格式的选项之一应该可以帮助您解决其他问题：

> reshape(df, idvar='V1', direction='wide', timevar='sample')
          V1 V2.Sample 1 V2.Sample 2 V2.Sample 3
2 Compound A           1           3           2
3 Compound B           1           3           2
4 Compound C           1           3           2

Answer 2

这是一种tidyverse方法，用于在R中完成相同的清理操作。我们可以：

read_lines文件以获取每个元素只有一行的字符向量
str_remove_all每行中的文字引号
将行放入tibble（数据框）列中
str_detect每行是包含数据的复合行，还是只是Sample标题。使用cumsum用正确的样本号标记“化合物”行，然后filter清除标题
separate来自浓度值的化合物标识符
spread将数据输出为多种格式。

library(tidyverse)
file <- read_lines(
'"Sample 1"
"Compound A", 1
"Compound B", 1
"Compound C", 1
"Sample 2"
"Compound A", 3
"Compound B", 3
"Compound C", 3
"Sample 3"
"Compound A", 2
"Compound B", 2
"Compound C", 2'
)
file %>%
  str_remove_all("\"") %>%
  tibble(line = .) %>%
  mutate(sample =  str_detect(line, "Sample") %>% cumsum %>% str_c("Sample_", .)) %>%
  filter(!str_detect(line, "Sample")) %>%
  separate(line, c("compound", "concentration"), sep = ", ") %>%
  spread(sample, concentration)
#> # A tibble: 3 x 4
#>   compound   Sample_1 Sample_2 Sample_3
#>   <chr>      <chr>    <chr>    <chr>   
#> 1 Compound A 1        3        2       
#> 2 Compound B 1        3        2       
#> 3 Compound C 1        3        2

^{由reprex package（v0.3.0）于2019-05-23创建}

如何转换/重新格式化CSV文件？

2 个答案: