我正在努力尝试组合来自两个文件的部分匹配的字符串。
文件1包含唯一字符串列表。这些字符串与文件2中的许多字符串部分匹配。如何为每个匹配的案例合并文件1中的行和文件2
File1中
mmu-miR-677-5p_MIMAT0017239
mmu-miR-181a-1-3p_MIMAT0000660
文件2
mmu-miR-677-5p_CTTCAGTGATGATTAGCTTCTGA
mmu-miR-677-5p_CTTCAGTGATGATTAGCTTCTGACT
mmu-miR-677-5p_TTCAGTGATGATTAGCTTCTGACT
mmu-miR-181a-1-3p_ACCATCGACCGTTGATTGTAC
mmu-miR-181a-1-3p_ACCATCGACCGTTGATTGTACC
期望的输出
mmu-miR-677-5p_MIMAT0017239 mmu-miR-677-5p_CTTCAGTGATGATTAGCTTCTGA
mmu-miR-677-5p_MIMAT0017239 mmu-miR-677-5p_CTTCAGTGATGATTAGCTTCTGACT
mmu-miR-677-5p_MIMAT0017239 mmu-miR-677-5p_TTCAGTGATGATTAGCTTCTGACT
mmu-miR-181a-1-3p_MIMAT0000660 mmu-miR-181a-1-3p_ACCATCGACCGTTGATTGTAC
mmu-miR-181a-1-3p_MIMAT0000660 mmu-miR-181a-1-3p_ACCATCGACCGTTGATTGTACC
我尝试在R中使用pmatch()
,但是没有把它弄好。我看起来像perl会处理的东西??
也许是这样的:
perl -ne'exec q;perl;, "-ne", q $print (/\Q$.$1.q;/?"$. YES":$. .q\; NO\;);, "file2" if m;^(.*)_pat1;' file1
答案 0 :(得分:4)
这是一个简短的Perl解决方案,它将file1
中的所有数据保存在哈希中,然后在扫描file2
时检索它
use strict;
use warnings;
use autodie;
my @files = qw/ file1.txt file2.txt /;
my %file1 = do {
open my $fh, '<', $files[0];
map /([^_]+)_(\S+)/, <$fh>;
};
open my $fh, '<', $files[1];
while (<$fh>) {
my ($key) = /([^_]+)/;
printf "%-32s%s", "${key}_$file1{$key}", $_;
}
<强>输出强>
mmu-miR-677-5p_MIMAT0017239 mmu-miR-677-5p_CTTCAGTGATGATTAGCTTCTGA
mmu-miR-677-5p_MIMAT0017239 mmu-miR-677-5p_CTTCAGTGATGATTAGCTTCTGACT
mmu-miR-677-5p_MIMAT0017239 mmu-miR-677-5p_TTCAGTGATGATTAGCTTCTGACT
mmu-miR-181a-1-3p_MIMAT0000660 mmu-miR-181a-1-3p_ACCATCGACCGTTGATTGTAC
mmu-miR-181a-1-3p_MIMAT0000660 mmu-miR-181a-1-3p_ACCATCGACCGTTGATTGTACC
答案 1 :(得分:3)
当然你可以用R来做。的确,pmatch
整个字符串不会给你想要的结果 - 你必须匹配合适的子串。
我假设在文件1中第一个标识符是677而不是667,否则很难猜测匹配方案(我假设你的例子只是更大数据库的一部分)。
file1 <- readLines(textConnection('mmu-miR-677-5p_MIMAT0017239
mmu-miR-181a-1-3p_MIMAT0000660'))
file2 <- readLines(textConnection('mmu-miR-677-5p_CTTCAGTGATGATTAGCTTCTGA
mmu-miR-677-5p_CTTCAGTGATGATTAGCTTCTGACT
mmu-miR-677-5p_TTCAGTGATGATTAGCTTCTGACT
mmu-miR-181a-1-3p_ACCATCGACCGTTGATTGTAC
mmu-miR-181a-1-3p_ACCATCGACCGTTGATTGTACC'))
library(stringi)
file1_id <- stri_extract_first_regex(file1, "^.*?(?=_)")
file2_id <- stri_extract_first_regex(file2, "^.*?(?=_)")
cbind(file1=file1[match(file2_id, file1_id)], file2=file2)
## file1 file2
## [1,] "mmu-miR-677-5p_MIMAT0017239" "mmu-miR-677-5p_CTTCAGTGATGATTAGCTTCTGA"
## [2,] "mmu-miR-677-5p_MIMAT0017239" "mmu-miR-677-5p_CTTCAGTGATGATTAGCTTCTGACT"
## [3,] "mmu-miR-677-5p_MIMAT0017239" "mmu-miR-677-5p_TTCAGTGATGATTAGCTTCTGACT"
## [4,] "mmu-miR-181a-1-3p_MIMAT0000660" "mmu-miR-181a-1-3p_ACCATCGACCGTTGATTGTAC"
## [5,] "mmu-miR-181a-1-3p_MIMAT0000660" "mmu-miR-181a-1-3p_ACCATCGACCGTTGATTGTACC"
答案 2 :(得分:2)
您可以agrep
进行模糊搜索。你应该玩距离。我在这里手动修复它到11。
基本上我这样做是为了提取与file1中每个单词匹配的行号:
sapply(file1,agrep,file2,max=11)
$`mmu-miR-677-5p_MIMAT0017239`
[1] 1 2 3
$`mmu-miR-181a-1-3p_MIMAT0000660`
[1] 4 5
获取data.frame的结果:
do.call(rbind,
lapply(file1,
function(x)
data.frame(file1=x,
file2=agrep(x,file2,max=11,value=T))))
file1 file2
1 mmu-miR-677-5p_MIMAT0017239 mmu-miR-677-5p_CTTCAGTGATGATTAGCTTCTGA
2 mmu-miR-677-5p_MIMAT0017239 mmu-miR-677-5p_CTTCAGTGATGATTAGCTTCTGACT
3 mmu-miR-677-5p_MIMAT0017239 mmu-miR-677-5p_TTCAGTGATGATTAGCTTCTGACT
4 mmu-miR-181a-1-3p_MIMAT0000660 mmu-miR-181a-1-3p_ACCATCGACCGTTGATTGTAC
5 mmu-miR-181a-1-3p_MIMAT0000660 mmu-miR-181a-1-3p_ACCATCGACCGTTGATTGTACC