我有一个格式奇怪的csv文件,该文件是我需要672行数据的乐器的输出。它具有多个样本,化合物的输出浓度垂直排列。看起来像这样:
"Sample 1"
"Compound A", 1
"Compound B", 1
"Compound C", 1
"Sample 2"
"Compound A", 3
"Compound B", 3
"Compound C", 3
"Sample 3"
"Compound A", 2
"Compound B", 2
"Compound C", 2
老实说,我不知道从实现这个目标开始。我通常会在R中进行这种类型的转换,但是读入R时文件格式仍然很笨拙。
在R中,当使用以下命令读取csv文件时:
Test <- read.csv("Test.csv", sep=",", header=FALSE)
我得到以下信息:
V1 V2
1 Sample 1 NA
2 Compound A 1
3 Compound B 1
4 Compound C 1
5 Sample 2 NA
6 Compound A 3
7 Compound B 3
8 Compound C 3
9 Sample 2 NA
10 Compound A 2
11 Compound B 2
12 Compound C 2
我希望获得一个输出文件,该文件具有“样品”作为列,而“化合物”作为行,且每种浓度均正确。例如:
Sample 1 Sample 2 Sample 3
Compound 1 1 3 2
Compound 2 1 3 2
Compound 3 1 3 2
因此,R解决方案或unix解决方案将可以正常工作,因为我可以将数据帧写入文本文件并在bash终端中使用它。
答案 0 :(得分:3)
R也是清理数据的好语言。我会做这样的事情:
df <- read.csv('/tmp/data', header=F)
v <- seq_len(nrow(df))
v[!is.na(df$V2)] <- NA
v <- zoo::na.locf(v)
df$sample <- df$V1[v]
df <- df[!is.na(df$V2),]
这将zoo::na.locf
用于主要任务,当一行的内容必须影响后续行时,我总是发现这是一个不错的选择。
现在您有一个data.frame
,其中包含样本编号的列:
V1 V2 sample
2 Compound A 1 Sample 1
3 Compound B 1 Sample 1
4 Compound C 1 Sample 1
6 Compound A 3 Sample 2
7 Compound B 3 Sample 2
8 Compound C 3 Sample 2
10 Compound A 2 Sample 3
11 Compound B 2 Sample 3
12 Compound C 2 Sample 3
使用“从高”到“宽”格式的选项之一应该可以帮助您解决其他问题:
> reshape(df, idvar='V1', direction='wide', timevar='sample')
V1 V2.Sample 1 V2.Sample 2 V2.Sample 3
2 Compound A 1 3 2
3 Compound B 1 3 2
4 Compound C 1 3 2
答案 1 :(得分:0)
这是一种tidyverse
方法,用于在R中完成相同的清理操作。我们可以:
read_lines
文件以获取每个元素只有一行的字符向量str_remove_all
每行中的文字引号tibble
(数据框)列中str_detect
每行是包含数据的复合行,还是只是Sample标题。使用cumsum
用正确的样本号标记“化合物”行,然后filter
清除标题separate
来自浓度值的化合物标识符spread
将数据输出为多种格式。library(tidyverse)
file <- read_lines(
'"Sample 1"
"Compound A", 1
"Compound B", 1
"Compound C", 1
"Sample 2"
"Compound A", 3
"Compound B", 3
"Compound C", 3
"Sample 3"
"Compound A", 2
"Compound B", 2
"Compound C", 2'
)
file %>%
str_remove_all("\"") %>%
tibble(line = .) %>%
mutate(sample = str_detect(line, "Sample") %>% cumsum %>% str_c("Sample_", .)) %>%
filter(!str_detect(line, "Sample")) %>%
separate(line, c("compound", "concentration"), sep = ", ") %>%
spread(sample, concentration)
#> # A tibble: 3 x 4
#> compound Sample_1 Sample_2 Sample_3
#> <chr> <chr> <chr> <chr>
#> 1 Compound A 1 3 2
#> 2 Compound B 1 3 2
#> 3 Compound C 1 3 2
由reprex package(v0.3.0)于2019-05-23创建