我有以下格式的文件,下面显示了几行。
<N2> AS 12/13:2:-1000.00,-25.73 13/13:2:-272.09,-12.81
<N2> AS 6/6:2:-1000.00,-19.88 8/8:2:-211.51,-5.98
<N0> AS 4/4:0:2:-218.21,-11.95 4/4:2:-208.55,-11.01
<N0> AS 0/0:2:-1000.00,-16.68 0/0:2:-294.18,-10.45
<N0> AS 0/1:2:-1000.00,-16.68 0/1:2:-294.18,-10.45
<N0> AS 1/1:2:-1000.00,-16.68 1/1:2:-294.18,-10.45
需要将$ 3中的第一个元素与$ 4中的第一个元素(由“:”分隔)进行比较,并仅使用0和1值进行重新编码。示例数据在四种可能的比较情况下的逻辑说明如下:
when only one value differ between the two elements then change to 0/0 and 0/1
when both values differ between the two elements then change to 0/0 and 1/1
when both values are same and non-zero between the two elements then change to 1/1 and 1/1
when both the values are arleady coded in 0 and 1 do not change them.
在上述逻辑的示例数据中,将$ 3中的第一个元素与$ 4进行比较。
12/13 and 13/13 have one value in common separated by "/" so change then to 0/0 and 1/1
6/6 and 8/8 both values separated by "/" differ between $3 and $4, so change to 0/0 and 1/1
4/4 and 4/4 both values separated by "/" are same between $3 and $4 and non-zero values so change to 1/1 and 1/1
如果值已经在0和1上编码,请不要更改。
因此上述示例的输出如下:
<N2> AS 0/0:2:-1000.00,-25.73 0/1:2:-272.09,-12.81
<N2> AS 0/0:2:-1000.00,-19.88 1/1/0:2:-211.51,-5.98
<N0> AS 1/1:0:2:-218.21,-11.95 1/1:2:-208.55,-11.01
<N0> AS 0/0:2:-1000.00,-16.68 0/0:2:-294.18,-10.45
<N0> AS 0/1:2:-1000.00,-16.68 0/1:2:-294.18,-10.45
<N0> AS 1/1:2:-1000.00,-16.68 1/1:2:-294.18,-10.45
awk或R中有可能的解决方案吗?
答案 0 :(得分:1)
您可以在R中执行以下操作。
数据:
df1<-
data.table::fread("<N2> AS 12/13:2:-1000.00,-25.73 13/13:2:-272.09,-12.81
<N2> AS 6/6:2:-1000.00,-19.88 8/8/0:2:-211.51,-5.98
<N0> AS 4/4:0:2:-218.21,-11.95 4/4:2:-208.55,-11.01
<N0> AS 0/0:2:-1000.00,-16.68 0/0:2:-294.18,-10.45
<N0> AS 0/1:2:-1000.00,-16.68 0/1:2:-294.18,-10.45
<N0> AS 1/1:2:-1000.00,-16.68 1/1:2:-294.18,-10.45",sep=" ",header=F) %>% setDF
代码:创建一个可以为您完成工作并加载库的函数:
library(magrittr)
library(dplyr)
fun1 <- function(df_in) {
vals <- lapply(df_in,function(x){sub("(\\d+/\\d+).*","\\1",x,perl=T) %>% strsplit("/") %>% lapply(as.numeric)})
newvals<-
mapply(function(x,y){
if(all(c(x,y) %in% 0:1)) list(paste0(x,collapse="/"),paste0(y,collapse="/")) else {
u = -abs(x-y)<=-1;
return(
case_when(
identical(u,c(T,F)) ~ list("0/0","0/1"),
identical(u,c(F,T)) ~ list("0/0","0/1"),
identical(u,c(T,T)) ~ list("0/0","1/1"),
identical(u,c(F,F)) ~ list("1/1","1/1"),
TRUE ~ list("Error","Error")
)
)
} },x=vals[[1]],y=vals[[2]])
return(
list(
paste0(unlist(newvals[1,]),sub("\\d+/\\d+","",df_in[[1]])),
paste0(unlist(newvals[2,]),sub("\\d+/\\d+","",df_in[[2]]))
)
)
}
调用功能:在需要更改的列号上:
df1[,3:4] %<>% fun1
结果:
#> df1
# V1 V2 V3 V4
#1 <N2> AS 0/0:2:-1000.00,-25.73 0/1:2:-272.09,-12.81
#2 <N2> AS 0/0:2:-1000.00,-19.88 1/1/0:2:-211.51,-5.98
#3 <N0> AS 1/1:0:2:-218.21,-11.95 1/1:2:-208.55,-11.01
#4 <N0> AS 0/0:2:-1000.00,-16.68 0/0:2:-294.18,-10.45
#5 <N0> AS 0/1:2:-1000.00,-16.68 0/1:2:-294.18,-10.45
#6 <N0> AS 1/1:2:-1000.00,-16.68 1/1:2:-294.18,-10.45