合并/加入两个大文件

时间:2014-03-28 11:23:18

标签: java c join awk merge

我想在第一列加入2个文件: 文件1包含46395029行,文件2包含86510559.

FILE1.TXT

>ID sequence
CJP75M1:393:C2T21ACXX:8:1101:2069:1997 1:N:0:_45    TAGTATTACGACG
CJP75M1:393:C2T21ACXX:8:1101:2711:1992 1:N:0:_65    TCCGAGGCCCTGTAATTGGAATGAGTAC
CJP75M1:393:C2T21ACXX:8:1101:3822:1989 1:N:0:_115   CCGGAGAGGGAGCCTGAGAAACGGCTACCAC

FILE2.TXT

>ID      Barcode
CJP75M1:393:C2T21ACXX:8:1101:2069:1997 1:N:0:_45    CTCG
CJP75M1:393:C2T21ACXX:8:1101:2711:1992 1:N:0:_65        CTAG
CJP75M1:393:C2T21ACXX:8:1101:3822:1989 1:N:0:_115       CTGG

我想在第一个列上合并这两个文件:

>TAGTATTACGACG    CTCG
TCCGAGGCCCTGTAATTGGAATGAGTAC     CTAG
CCGGAGAGGGAGCCTGAGAAACGGCTACCAC     CTGG

只想要文件1中的行,因此结果文件应包含“仅”46395029行。 我是用awk做的:

    awk 'BEGIN {FS= "\t"; OFS="\t"} { while (getline < "file1.txt") { f[$1] = $2} {print $2, f[$1] }}' "file2.txt" | sed '1d' > result.txt

但它真的很长(它运行2天)。我有一个64位/ 16位RAM的linux debian(稳定)服务器

有什么想法吗? 感谢

3 个答案:

答案 0 :(得分:1)

以下是awk的另一种方式:

awk 'FNR==1{next}NR==FNR{map[$1,$2]=$3;next}(map[$1,$2]){print map[$1,$2],$3}' file2 file1
  • 从两个文件中略过第一行
  • 从file2
  • 创建一个在第1列和第2列索引的数组
  • 如果file1中存在map,则打印行

答案 1 :(得分:1)

以下是Java(7+)的解决方案 - 您要求它:)

import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.IOException;
import java.nio.charset.Charset;
import java.nio.charset.StandardCharsets;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.util.HashMap;
import java.util.Map;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

import static java.nio.file.StandardOpenOption.*;

public final class Job
{
    private static final Pattern PATTERN
        = Pattern.compile("(\\S+\\s+\\S+)\\s+(.*)");

    public static void main(final String... args)
        throws IOException
    {
        final Map<String, String> fromFile1 = new HashMap<>();

        final Charset charset = StandardCharsets.US_ASCII;
        final Path file1 = Paths.get("/tmp/f1.txt");
        final Path file2 = Paths.get("/tmp/f2.txt");
        final Path dstfile = Paths.get("/tmp/dst.txt");
        Matcher m;
        String line, key, value;
        StringBuilder sb;

        // Lines from file 1
        try (
            final BufferedReader reader = Files.newBufferedReader(file1,
                charset);
        ) {
            reader.readLine();
            while ((line = reader.readLine()) != null) {
                m = PATTERN.matcher(line);
                if (m.matches())
                    fromFile1.put(m.group(1), m.group(2));
            }
        }

        // Write in destination file
        try (
            final BufferedReader reader = Files.newBufferedReader(file2,
                charset);
            final BufferedWriter writer = Files.newBufferedWriter(dstfile,
                charset, CREATE, TRUNCATE_EXISTING);
        ) {
            reader.readLine();
            while ((line = reader.readLine()) != null) {
                m = PATTERN.matcher(line);
                if (!m.matches())
                    continue;
                key = m.group(1);
                value = fromFile1.get(key);
                if (value == null)
                    continue;
                sb = new StringBuilder(value).append('\t')
                    .append(m.group(2)).append('\n');
                writer.write(sb.toString());
            }
            writer.flush();
        }
    }
}

将其放在名为Job.java的文件中。要编译,您需要JDK 7+,并且:

javac Job.java

要执行,你需要相当多的记忆,所以:

java -Xmx4G Job

当然,适当地改变路径!


请注意,如果您必须经常操作此类文件,我建议您尽可能使线条固定;治疗会更快。甚至可能使用数据库引擎?

答案 2 :(得分:1)

join命令可能是您所需要的:join要求输入文件在连接字段中排序

join -o 1.3,2.3 -a 1 -e "??" <(sed 1d file1.txt | sort -k1,1) <(sed 1d file2.txt | sort -k1,1)

根据您的样本数据生成:

CGGACGTGATCACTGTGACGCCTTGCGTGTTACGGTTGTT CNCG
TAGTATTACGACG AGGC
TCCGAGGCCCTGTAATTGGAATGAGTAC ??
CCGGAGAGGGAGCCTGAGAAACGGCTACCAC ??join -o 1.3,2.3 -a 1 -e "??" <(sed 1d file1.txt | sort -k1,1) <(sed 1d file2.txt | sort -k1,1)
TTGGAGGGC ??
TTGATGGTAGTATC ??
AATAAAACGATGCATTTATGTATTTTTGATT ??
TCCTCGATAGTATAGTGGTTAGTATCCCCGCC ??
TGATGGTAGTATC ??

有了这么多数据,我认为最好的办法是将数据导入关系数据库(例如sqlite)并使用SQL生成报告。