我有一个大文件〜100k行和100列,我想基于另一列创建提取四列的信息。有一个名为Caller
的列,该列告诉您带有.sample
的哪些列将包含noSample
以外的信息。
我已经尝试过使用if and else if
语句,但是有时会满足两个条件,并且编写所有可能的组合会花费很多精力,而且我很确定会有更好的方法
我的真实data.frame看起来像这样:
编辑
Df <- data.frame(A = c("chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1"),
B= c(10,12,13,14,15,16,17),
Caller = c("A", "B", "C", "D", "A,C", "A,B,C", "B,D"),
A.sample = c("3xd|432", "noSample","noSample","noSample","1234|567|87sd","234|456|897a","noSample"),
dummy1 = 1:7,
B.sample = c("noSample", "456|789|asd", "noSample","noSample","noSample","674e|7892|123|432","bgcf|12er|567|zxs3|12ple"),
dummy2 = 1:7,
C.sample = c("noSample","noSample", "zxc|vbn|mn","noSample","gfd3|123|456|789","674e|7892|123","noSample" ),
dummy3 = 1:7,
D.sample = c("noSample","noSample", "noSample", "poi|uyh|gfrt|562", "noSample", "noSample", "567|zxs3|12ple"), stringsAsFactors=FALSE)
我想为每一行提取一个样本向量。这可以存储在列表或另一个R对象上。我将这些样本与每个样本与一个流程相关联的data.frame进行匹配。
My desired output would be
>row1
3xd|432
>row2
456|789|asd
>row3
zxc|vbn|mn
>row4
poi|uyh|gfrt|562
>row5
[1]1234|567|87sd [2]gfd3|123|456|789
>row6
[1]234|456|897a [2]674e|7892|123|432 [3]674e|7892|123
>row7
[1]bgcf|12er|567|zxs3|12ple [2]567|zxs3|12ple
我想要的输出将不包括样本之间的管道|
,但是我可以使用strsplit
来消除它
由于data.frame很大,因此速度至关重要。
答案 0 :(得分:2)
这是一个可能的解决方案:
Df <- data.frame(A = c("chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1"),
B= c(10,12,13,14,15,16,17),
Caller = c("A", "B", "C", "D", "A,C", "A,B,C", "B,D"),
A.sample = c("3xd|432", "noSample","noSample","noSample","1234|567|87sd","234|456|897a","noSample"),
B.sample = c("noSample", "456|789|asd", "noSample","noSample","noSample","674e|7892|123|432","bgcf|12er|567|zxs3|12ple"),
C.sample = c("noSample","noSample", "zxc|vbn|mn","noSample","gfd3|123|456|789","674e|7892|123","noSample" ),
D.sample = c("noSample","noSample", "noSample", "poi|uyh|gfrt|562", "noSample", "noSample", "567|zxs3|12ple"),
stringsAsFactors=FALSE)
#find names of columns
names<-substr(names(Df), 1, 1)
#Set unwanted names to NA
names[-c(4:ncol(Df))]<-NA
#create a regular expression by replacing the comma with the or |
reg<-gsub(",", "\\|", Df$Caller)
#find the column matches
columns<-sapply(reg, function(x){grep(x, names)})
#extract the desired columns out into a list
lapply(seq_along(columns), function(x){Df[x,columns[[x]]]})
我在数据框定义中添加了stringsAsFactors=FALSE
,以便删除与因子级别相关的负担。
答案 1 :(得分:2)
仅显示实现所需结果的多种可能方法之一。请注意,我使用与@ Dave2e相同的数据帧,即,我已将stringsAsFactors=F
添加到对data.frame
的调用中。
library(tidyverse)
out <- df %>% rowid_to_column() %>% # adding explicit row IDs
gather(key, value, -rowid, -A, -B, -Caller) %>% # reshaping the dataframe
filter(value != "noSample")
结果数据帧将如下所示:
out
rowid A B Caller key value
1 1 chr1 10 A A.sample 3xd|432
2 5 chr1 15 A,C A.sample 1234|567|87sd
3 6 chr1 16 A,B,C A.sample 234|456|897a
4 2 chr1 12 B B.sample 456|789|asd
5 6 chr1 16 A,B,C B.sample 674e|7892|123|432
6 7 chr1 17 B,D B.sample bgcf|12er|567|zxs3|12ple
7 3 chr1 13 C C.sample zxc|vbn|mn
8 5 chr1 15 A,C C.sample gfd3|123|456|789
9 6 chr1 16 A,B,C C.sample 674e|7892|123
10 4 chr1 14 D D.sample poi|uyh|gfrt|562
11 7 chr1 17 B,D D.sample 567|zxs3|12ple
现在,我们可以简单地子集化以获取所需结果:
out[out$rowid == 1,"value"]
[1] "3xd|432"
out[out$rowid == 5,"value"]
[1] "1234|567|87sd" "gfd3|123|456|789"