我想在第一列加入2个文件: 文件1包含46395029行,文件2包含86510559.
FILE1.TXT
>ID sequence
CJP75M1:393:C2T21ACXX:8:1101:2069:1997 1:N:0:_45 TAGTATTACGACG
CJP75M1:393:C2T21ACXX:8:1101:2711:1992 1:N:0:_65 TCCGAGGCCCTGTAATTGGAATGAGTAC
CJP75M1:393:C2T21ACXX:8:1101:3822:1989 1:N:0:_115 CCGGAGAGGGAGCCTGAGAAACGGCTACCAC
FILE2.TXT
>ID Barcode
CJP75M1:393:C2T21ACXX:8:1101:2069:1997 1:N:0:_45 CTCG
CJP75M1:393:C2T21ACXX:8:1101:2711:1992 1:N:0:_65 CTAG
CJP75M1:393:C2T21ACXX:8:1101:3822:1989 1:N:0:_115 CTGG
我想在第一个列上合并这两个文件:
>TAGTATTACGACG CTCG
TCCGAGGCCCTGTAATTGGAATGAGTAC CTAG
CCGGAGAGGGAGCCTGAGAAACGGCTACCAC CTGG
只想要文件1中的行,因此结果文件应包含“仅”46395029行。 我是用awk做的:
awk 'BEGIN {FS= "\t"; OFS="\t"} { while (getline < "file1.txt") { f[$1] = $2} {print $2, f[$1] }}' "file2.txt" | sed '1d' > result.txt
但它真的很长(它运行2天)。我有一个64位/ 16位RAM的linux debian(稳定)服务器
有什么想法吗? 感谢
答案 0 :(得分:1)
以下是awk
的另一种方式:
awk 'FNR==1{next}NR==FNR{map[$1,$2]=$3;next}(map[$1,$2]){print map[$1,$2],$3}' file2 file1
答案 1 :(得分:1)
以下是Java(7+)的解决方案 - 您要求它:)
import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.IOException;
import java.nio.charset.Charset;
import java.nio.charset.StandardCharsets;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.util.HashMap;
import java.util.Map;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import static java.nio.file.StandardOpenOption.*;
public final class Job
{
private static final Pattern PATTERN
= Pattern.compile("(\\S+\\s+\\S+)\\s+(.*)");
public static void main(final String... args)
throws IOException
{
final Map<String, String> fromFile1 = new HashMap<>();
final Charset charset = StandardCharsets.US_ASCII;
final Path file1 = Paths.get("/tmp/f1.txt");
final Path file2 = Paths.get("/tmp/f2.txt");
final Path dstfile = Paths.get("/tmp/dst.txt");
Matcher m;
String line, key, value;
StringBuilder sb;
// Lines from file 1
try (
final BufferedReader reader = Files.newBufferedReader(file1,
charset);
) {
reader.readLine();
while ((line = reader.readLine()) != null) {
m = PATTERN.matcher(line);
if (m.matches())
fromFile1.put(m.group(1), m.group(2));
}
}
// Write in destination file
try (
final BufferedReader reader = Files.newBufferedReader(file2,
charset);
final BufferedWriter writer = Files.newBufferedWriter(dstfile,
charset, CREATE, TRUNCATE_EXISTING);
) {
reader.readLine();
while ((line = reader.readLine()) != null) {
m = PATTERN.matcher(line);
if (!m.matches())
continue;
key = m.group(1);
value = fromFile1.get(key);
if (value == null)
continue;
sb = new StringBuilder(value).append('\t')
.append(m.group(2)).append('\n');
writer.write(sb.toString());
}
writer.flush();
}
}
}
将其放在名为Job.java
的文件中。要编译,您需要JDK 7+,并且:
javac Job.java
要执行,你需要相当多的记忆,所以:
java -Xmx4G Job
当然,适当地改变路径!
请注意,如果您必须经常操作此类文件,我建议您尽可能使线条固定;治疗会更快。甚至可能使用数据库引擎?
答案 2 :(得分:1)
join
命令可能是您所需要的:join
要求输入文件在连接字段中排序
join -o 1.3,2.3 -a 1 -e "??" <(sed 1d file1.txt | sort -k1,1) <(sed 1d file2.txt | sort -k1,1)
根据您的样本数据生成:
CGGACGTGATCACTGTGACGCCTTGCGTGTTACGGTTGTT CNCG
TAGTATTACGACG AGGC
TCCGAGGCCCTGTAATTGGAATGAGTAC ??
CCGGAGAGGGAGCCTGAGAAACGGCTACCAC ??join -o 1.3,2.3 -a 1 -e "??" <(sed 1d file1.txt | sort -k1,1) <(sed 1d file2.txt | sort -k1,1)
TTGGAGGGC ??
TTGATGGTAGTATC ??
AATAAAACGATGCATTTATGTATTTTTGATT ??
TCCTCGATAGTATAGTGGTTAGTATCCCCGCC ??
TGATGGTAGTATC ??
有了这么多数据,我认为最好的办法是将数据导入关系数据库(例如sqlite)并使用SQL生成报告。