从affymetrix SNP阵列获取LRR和BAF值

时间:2015-10-02 13:28:18

标签: arrays r windows bioinformatics

我试图从affymetrix SNP芯片中提取LRRBAF值,但未成功使用基于Linux的工具。我尝试在名为Axiom™ CNV Summary Tools Software的Windows设计软件中使用一个小子集,它运行得很好。问题是我有一个庞大的数据集,并且无法在足够强大的Windows机器上运行。

让我们揭开我的步骤直到这一点。首先,我获得了五个制表符分隔文件,这些文件需要linux和/或windows管道(1-3使用APT affymetrix软件获得)。

1 - Axiom calls.txt或基因型文件:

   calls <- 'probeset_id    sample_1    sample_2    sample_3
    AX-100010998    2   2   2
    AX-100010999    1   0   1
    AX-100011005    0   1   2
    AX-100011007    2   2   1
    AX-100011008    1   1   2
    AX-100011010    2   2   2
    AX-100011011    0   1   0
    AX-100011012    0   1   0
    AX-100011016    0   0   1
    AX-100011017    0   0   2'
calls <- read.table(text=calls, header=T)

2 - confidences.txt文件:

conf<- 'probeset_id sample_1    sample_2    sample_3
AX-100010998    0.00001 0.0002  0.00006
AX-100010999    0.00001 0.00001 0.00001
AX-100011005    0.00007 0.00017 0.00052
AX-100011007    0.00001 0.00001 0.00001
AX-100011008    0.001   0.00152 0.00001
AX-100011010    0.00001 0.00001 0.00002
AX-100011011    0.00004 0.00307 0.00002
AX-100011012    0.00001 0.00001 0.00001
AX-100011016    0.00003 0.00001 0.00001
AX-100011017    0.00003 0.01938 0.00032' 
conf <- read.table(text=conf, header=T)

3 - summary.txt文件:

summ <- 'probeset_id    sample_1    sample_2    sample_3
AX-100010998-A  740.33229   655.41465   811.98053
AX-100010998-B  1139.25679  1659.55079  917.7128
AX-100010999-A  1285.67306  1739.03296  1083.48455
AX-100010999-B  1403.51265  341.85893   1237.48577
AX-100011005-A  1650.03408  1274.57594  485.5324
AX-100011005-B  430.3122    2674.70182  4070.90727
AX-100011007-A  411.28952   449.76345   2060.7136
AX-100011007-B  4506.77692  4107.12982  2065.58516
AX-100011008-A  427.78263   439.63541   333.86312
AX-100011008-B  1033.41335  1075.31617  1623.69271
AX-100011010-A  390.12996   350.54456   356.63156
AX-100011010-B  1183.29912  1256.01391  1650.82396
AX-100011011-A  3593.93578  2902.34079  2776.2503
AX-100011011-B  867.33447   2252.54552  961.31596
AX-100011012-A  2250.44699  1192.46116  1927.70581
AX-100011012-B  740.31957   1721.70283  662.1414
AX-100011016-A  1287.9221   1367.95468  1037.98191
AX-100011016-B  554.8795    666.93132   1487.2143
AX-100011017-A  2002.40468  1787.42982  490.28802
AX-100011017-B  849.92775   1025.44417  1429.96567'
summ <- read.table(text=summ, header=T)

4 - gender.txt

gender <- 'cel_files    gender
sample_1    female
sample_2    female
sample_3    female'
gender <- read.table(text=gender, header=F)

最后,windows中的地图文件map.db(不可读)或linux中的map.txt如下:

map <- 'Name    Chr Position
AX-100010998    Z   70667736
AX-100010999    4   36427048
AX-100011005    26  4016045
AX-100011007    6   25439800
AX-100011008    2   147800617
AX-100011010    1   98919397
AX-100011011    Z   66652642
AX-100011012    7   28180218
AX-100011016    1A  33254907
AX-100011017    5   1918020'
map <- read.table(text=map, header=T)

这是sample_1基于Windows的结果的结果:

Name    Chr Position    sample_1.GType  sample_1.Log R Ratio    sample_1.B Allele Freq
AX-100010998    Z   70667736    BB  Infinity    0.675637419295063
AX-100010999    4   36427048    AB  0.101639462657534   0.531373516807123
AX-100011005    26  4016045 AA  -0.111910305454305  0
AX-100011007    6   25439800    BB  0.148781943283483   1
AX-100011008    2   147800617   AB  -0.293273363654622  0.609503132331127
AX-100011010    1   98919397    BB  -0.283993308525307  0.960031843823016
AX-100011011    Z   66652642    AA  Infinity    0.00579049667757003
AX-100011012    7   28180218    AA  0.0245684274744242  0.032174599843476
AX-100011016    1A  33254907    AA  -0.265925457515035  0
AX-100011017    5   1918020 AA  -0.0091211520536838 0

基于Windows的工具的值似乎是正确的,但在Linux输出中并非如此。我按照penncnv软件(http://penncnv.openbioinformatics.org/en/latest/user-guide/input/)中描述的步骤进行操作,然后我使用summary.txt对我limma进行了log2转换并使用normalizeBetweenArrays(x)包进行了分位数归一化,最后使用{{ 1}}:

corrsummary.txt

因此我申请了:

      corrsum <- 'probeset_id  sample_1  sample_2  sample_3
    AX-100010998-A  9.804932  9.285738  9.530882
    AX-100010998-B 10.249239 10.528922  9.804932
    AX-100010999-A 10.528922 10.641862 10.134816
    AX-100010999-B 10.641862  8.472829 10.249239
    AX-100011005-A 10.804446 10.249239  8.816931
    AX-100011005-B  8.835381 11.186266 12.045852
    AX-100011007-A  8.542343  8.835381 11.039756
    AX-100011007-B 12.045852 12.045852 11.186266
    AX-100011008-A  8.816931  8.816931  8.472829
    AX-100011008-B 10.134816  9.910173 10.592867
    AX-100011010-A  8.472829  8.542343  8.542343
    AX-100011010-B 10.374032 10.134816 10.641862
    AX-100011011-A 11.593784 11.593784 11.593784
    AX-100011011-B 10.012055 11.039756  9.910173
    AX-100011012-A 11.186266 10.012055 10.804446
    AX-100011012-B  9.530882 10.592867  9.285738
    AX-100011016-A 10.592867 10.374032 10.012055
    AX-100011016-B  9.285738  9.530882 10.528922
    AX-100011017-A 11.039756 10.804446  8.835381
    AX-100011017-B  9.910173  9.804932 10.374032'
corrsum <- read.table(text=corrsum, header=T)

 ./generate_affy_geno_cluster.pl calls.txt confidences.txt corrsummary.txt --locfile map.txt --sexfile gender.txt --output gencluster

我的基于linux的结果 ./normalize_affy_geno_cluster.pl --locfile map.txt gencluster calls.txt --output lrrbaf.txt )必须包含lrrbaf.txtLRR信息,如下所示:

BAF

如上所示,linux结果与基于Windows的结果完全不同(并且没有多大意义),另外不包含output <- 'Name Chr Position sample_1.LogRRatio sample_1.BAlleleFreq sample_2.LogRRatio sample_2.BAlleleFreq sample_3.LogRRatio sample_3.BAlleleFreq AX-100010999 4 36427048 -1952.0739 2 -1953.0739 2 -1952.0739 2 AX-100011005 26 4016045 -2245.1784 2 -2244.1784 2 -2243.1784 2 AX-100011007 6 25439800 -4433.4661 2 -4433.4661 2 -4434.4661 2 AX-100011008 2 147800617 -1493.2287 2 -1493.2287 2 -1492.2287 2 AX-100011011 Z 66652642 -4088.2311 2 -4087.2311 2 -4088.2311 2 AX-100011012 7 28180218 -2741.2623 2 -2740.2623 2 -2741.2623 2 AX-100011016 1A 33254907 -2117.7005 2 -2117.7005 2 -2116.7005 2 AX-100011017 5 1918020 -3067.4077 2 -3067.4077 2 -3065.4077 2' output <- read.table(text=output, header=T) 中的GType列。很抱歉撰写这么长的问题,但我的目的是让它尽可能重现。我很感激任何解决这个问题的方法,以及关于我可能忘记的这类数据的任何重要评论。

0 个答案:

没有答案