将linux(或R)中的两个文件与一个公共列重复组合

时间:2013-08-06 16:21:45

标签: linux r duplicates

我有两个包含多个列和行的大文件。这两个文件都包含TAG列,它在一个文件中没有重复项,而在另一个文件中有重复项。

看起来像这样:

FILE1:

stat    stat    P-value tag
0.3049  7.464   1.875e-11   L2_None_chr1_-_109092036
0.2961  7.448   2.105e-11   L2_None_chr1_-_109092036
0.2934  7.347   3.389e-11   L2_None_chr1_-_109092036
0.2961  7.245   5.668e-11   L2_None_chr1_-_109092036
0.6682  7.284   4.664e-11   L2_None_chr1_-_109957962
0.6682  7.284   4.664e-11   L2_None_chr1_-_109957962
0.3933  7.363   3.127e-11   L2_None_chr1_-_159842839
0.3808  7.284   4.672e-11   L2_None_chr1_-_159842839
0.2993  7.17    8.278e-11   L2_None_chr1_-_169972458
0.3312  7.817   3.075e-12   L2_None_chr1_-_203626998
0.3312  7.817   3.075e-12   L2_None_chr1_-_203626998
0.614   7.616   9.742e-12   L2_None_chr1_-_569826
0.6411  7.58    1.037e-11   L2_None_chr1_-_569826
0.5755  7.275   4.871e-11   L2_None_chr1_-_569826
0.6893  7.26    5.255e-11   L2_None_chr1_-_6546011
0.3136  7.529   1.35e-11    L2_None_chr1_-_91180355
0.3262  7.449   2.023e-11   L2_None_chr1_-_91180355
0.298   7.151   9.129e-11   L2_None_chr1_-_91180355
0.2999  7.149   9.201e-11   L2_None_chr1_-_91182695
0.5383  7.189   7.534e-11   L2_None_chr1_-_91183491

FILE2:

L2_None_chr1_-_109092036    chr1    109092034
L2_None_chr1_-_109957962    chr1    109957879
L2_None_chr1_-_159842839    chr1    159842779
L2_None_chr1_-_169972458    chr1    169972444
L2_None_chr1_-_203626998    chr1    203626983
L2_None_chr1_-_569826   chr1    569802
L2_None_chr1_-_6546011  chr1    6545930
L2_None_chr1_-_91180355 chr1    91180310
L2_None_chr1_-_91182695 chr1    91182572
L2_None_chr1_-_91183491 chr1    91183389

我想要的东西;

stat    P-value tag tag chr bp
7.464   1.875e-11   L2_None_chr1_-_109092036    L2_None_chr1_-_109092036    1   109092036
7.448   2.105e-11   L2_None_chr1_-_109092036    L2_None_chr1_-_109092036    1   109092036
7.347   3.389e-11   L2_None_chr1_-_109092036    L2_None_chr1_-_109092036    1   109092036
7.245   5.668e-11   L2_None_chr1_-_109092036    L2_None_chr1_-_109092036    1   109092036
7.284   4.664e-11   L2_None_chr1_-_109957962    L2_None_chr1_-_109957962    1   109957962
7.284   4.664e-11   L2_None_chr1_-_109957962    L2_None_chr1_-_109957962    1   109957962
7.363   3.127e-11   L2_None_chr1_-_159842839    L2_None_chr1_-_159842839    1   159842839
7.284   4.672e-11   L2_None_chr1_-_159842839    L2_None_chr1_-_159842839    1   159842839
7.17    8.278e-11   L2_None_chr1_-_169972458    L2_None_chr1_-_169972458    1   169972458
7.817   3.075e-12   L2_None_chr1_-_203626998    L2_None_chr1_-_203626998    1   203626998
7.817   3.075e-12   L2_None_chr1_-_203626998    L2_None_chr1_-_203626998    1   203626998
7.616   9.742e-12   L2_None_chr1_-_569826   L2_None_chr1_-_569826   1   569826
7.58    1.037e-11   L2_None_chr1_-_569826   L2_None_chr1_-_569826   1   569826
7.275   4.871e-11   L2_None_chr1_-_569826   L2_None_chr1_-_569826   1   569826
7.26    5.255e-11   L2_None_chr1_-_6546011  L2_None_chr1_-_6546011  1   6546011
7.529   1.35e-11    L2_None_chr1_-_91180355 L2_None_chr1_-_91180355 1   91180355
7.449   2.023e-11   L2_None_chr1_-_91180355 L2_None_chr1_-_91180355 1   91180355
7.151   9.129e-11   L2_None_chr1_-_91180355 L2_None_chr1_-_91180355 1   91180355
7.149   9.201e-11   L2_None_chr1_-_91182695 L2_None_chr1_-_91182695 1   91182695
7.189   7.534e-11   L2_None_chr1_-_91183491 L2_None_chr1_-_91183491 1   91183491

我在R中尝试了函数match,但这并没有完全帮助我......

2 个答案:

答案 0 :(得分:2)

这应该点缀它:

 merge(dat,dat1,by.x='tag',by.y='tag')
                      tag   stat stat.1   P.value   V2        V3
1  L2_None_chr1_-_109092036 0.3049  7.464 1.875e-11 chr1 109092034
2  L2_None_chr1_-_109092036 0.2961  7.448 2.105e-11 chr1 109092034
3  L2_None_chr1_-_109092036 0.2934  7.347 3.389e-11 chr1 109092034
4  L2_None_chr1_-_109092036 0.2961  7.245 5.668e-11 chr1 109092034
5  L2_None_chr1_-_109957962 0.6682  7.284 4.664e-11 chr1 109957879
6  L2_None_chr1_-_109957962 0.6682  7.284 4.664e-11 chr1 109957879
7  L2_None_chr1_-_159842839 0.3933  7.363 3.127e-11 chr1 159842779
8  L2_None_chr1_-_159842839 0.3808  7.284 4.672e-11 chr1 159842779
9  L2_None_chr1_-_169972458 0.2993  7.170 8.278e-11 chr1 169972444
10 L2_None_chr1_-_203626998 0.3312  7.817 3.075e-12 chr1 203626983
11 L2_None_chr1_-_203626998 0.3312  7.817 3.075e-12 chr1 203626983
12    L2_None_chr1_-_569826 0.6140  7.616 9.742e-12 chr1    569802
13    L2_None_chr1_-_569826 0.6411  7.580 1.037e-11 chr1    569802
14    L2_None_chr1_-_569826 0.5755  7.275 4.871e-11 chr1    569802
15   L2_None_chr1_-_6546011 0.6893  7.260 5.255e-11 chr1   6545930
16  L2_None_chr1_-_91180355 0.3136  7.529 1.350e-11 chr1  91180310
17  L2_None_chr1_-_91180355 0.3262  7.449 2.023e-11 chr1  91180310
18  L2_None_chr1_-_91180355 0.2980  7.151 9.129e-11 chr1  91180310
19  L2_None_chr1_-_91182695 0.2999  7.149 9.201e-11 chr1  91182572
20  L2_None_chr1_-_91183491 0.5383  7.189 7.534e-11 chr1  91183389

答案 1 :(得分:1)

您可能正在寻找linux join命令。 man join是一个开始,你的命令就像这样

join -1 4 -2 1 <(sort FILE1) <(sort FILE2)

-1-2指定将用于匹配的相应文件中的字段。如果文件已经排序,则不需要sort