比较两列中的值并用R或awk重新编码

时间:2018-08-22 08:02:35

标签: r awk

我有以下格式的文件,下面显示了几行。

<N2>    AS  12/13:2:-1000.00,-25.73     13/13:2:-272.09,-12.81
<N2>    AS  6/6:2:-1000.00,-19.88   8/8:2:-211.51,-5.98
<N0>    AS  4/4:0:2:-218.21,-11.95  4/4:2:-208.55,-11.01
<N0>    AS  0/0:2:-1000.00,-16.68   0/0:2:-294.18,-10.45
<N0>    AS  0/1:2:-1000.00,-16.68   0/1:2:-294.18,-10.45
<N0>    AS  1/1:2:-1000.00,-16.68   1/1:2:-294.18,-10.45

需要将$ 3中的第一个元素与$ 4中的第一个元素(由“:”分隔)进行比较,并仅使用0和1值进行重新编码。示例数据在四种可能的比较情况下的逻辑说明如下:

when only one value differ between the two elements then change to 0/0  and 0/1    
when both values differ between the two elements then change to 0/0  and 1/1  
when both values are same and non-zero  between the two elements  then change to 1/1  and  1/1
when both the values are arleady coded in 0 and 1 do not change them.

在上述逻辑的示例数据中,将$ 3中的第一个元素与$ 4进行比较。

12/13  and  13/13 have one value in common separated by "/" so change then to 0/0 and 1/1
6/6 and 8/8 both values separated by "/" differ between $3 and $4, so change to 0/0 and 1/1
4/4 and 4/4 both values separated by "/" are same between $3 and $4 and non-zero values so change to   1/1 and 1/1

如果值已经在0和1上编码,请不要更改。

因此上述示例的输出如下:

<N2>    AS  0/0:2:-1000.00,-25.73   0/1:2:-272.09,-12.81
<N2>    AS  0/0:2:-1000.00,-19.88   1/1/0:2:-211.51,-5.98
<N0>    AS  1/1:0:2:-218.21,-11.95  1/1:2:-208.55,-11.01
<N0>    AS  0/0:2:-1000.00,-16.68   0/0:2:-294.18,-10.45
<N0>    AS  0/1:2:-1000.00,-16.68   0/1:2:-294.18,-10.45
<N0>    AS  1/1:2:-1000.00,-16.68   1/1:2:-294.18,-10.45

awk或R中有可能的解决方案吗?

1 个答案:

答案 0 :(得分:1)

您可以在R中执行以下操作。

数据:

df1<-
data.table::fread("<N2>    AS  12/13:2:-1000.00,-25.73     13/13:2:-272.09,-12.81
<N2>    AS  6/6:2:-1000.00,-19.88   8/8/0:2:-211.51,-5.98
                  <N0>    AS  4/4:0:2:-218.21,-11.95  4/4:2:-208.55,-11.01
                  <N0>    AS  0/0:2:-1000.00,-16.68   0/0:2:-294.18,-10.45
                  <N0>    AS  0/1:2:-1000.00,-16.68   0/1:2:-294.18,-10.45
                  <N0>    AS  1/1:2:-1000.00,-16.68   1/1:2:-294.18,-10.45",sep=" ",header=F) %>% setDF

代码:创建一个可以为您完成工作并加载库的函数:

library(magrittr)
library(dplyr)
fun1 <- function(df_in) {
    vals <- lapply(df_in,function(x){sub("(\\d+/\\d+).*","\\1",x,perl=T) %>% strsplit("/") %>% lapply(as.numeric)})
    newvals<-
        mapply(function(x,y){
            if(all(c(x,y) %in% 0:1)) list(paste0(x,collapse="/"),paste0(y,collapse="/")) else {
                u = -abs(x-y)<=-1;
                return(
                    case_when(
                        identical(u,c(T,F)) ~ list("0/0","0/1"),
                        identical(u,c(F,T)) ~ list("0/0","0/1"),
                        identical(u,c(T,T)) ~ list("0/0","1/1"),
                        identical(u,c(F,F)) ~ list("1/1","1/1"),
                        TRUE    ~ list("Error","Error")
                    )
                )
            } },x=vals[[1]],y=vals[[2]])
    return(
        list(
            paste0(unlist(newvals[1,]),sub("\\d+/\\d+","",df_in[[1]])),
            paste0(unlist(newvals[2,]),sub("\\d+/\\d+","",df_in[[2]]))
        )
    )
}

调用功能:在需要更改的列号上:

df1[,3:4] %<>% fun1

结果:

#> df1
#    V1 V2                     V3                    V4
#1 <N2> AS  0/0:2:-1000.00,-25.73  0/1:2:-272.09,-12.81
#2 <N2> AS  0/0:2:-1000.00,-19.88 1/1/0:2:-211.51,-5.98
#3 <N0> AS 1/1:0:2:-218.21,-11.95  1/1:2:-208.55,-11.01
#4 <N0> AS  0/0:2:-1000.00,-16.68  0/0:2:-294.18,-10.45
#5 <N0> AS  0/1:2:-1000.00,-16.68  0/1:2:-294.18,-10.45
#6 <N0> AS  1/1:2:-1000.00,-16.68  1/1:2:-294.18,-10.45