Question

我想加入两个制表符分隔的文件，但它们的顺序不同。我知道awk是可行的，但我不知道怎么做。这是相当的玩具python代码（对于没有疯狂的解决方法，python对于这个任务而言内存效率太低）：

import pandas as pd
from random import shuffle

a = ['bar','qux','baz','foo','spam']
df = pd.DataFrame({'nam':a,'asc':[1,2,3,4,5],'desc':[5,4,3,2,1]})

shuffle(a)
print(a)

dex = pd.DataFrame({'dex' : a})
df_b = pd.DataFrame({'VAL1' :[0,1,2,3,4,5,6]})

pd.merge(dex, df,left_on='dex',right_on='nam')[['asc','desc','nam']]

我有两个文件：对于文件一，第2列包含每行的标识符，有5列我不需要，然后有大约300万列数据。

对于文件二，有12列，第二列包含不同顺序的相同标识符，以及其他ID。

我想将文件一排序为具有与文件二相同的标识符和顺序，其他列适当地重新排列。

文件一可能是几千兆字节。

使用awk和/或其他GNU工具是否更容易，或者我应该使用perl吗？

Answer 1

如果file1的大小为GB的数量级，并且您有300万列数据，则您的行数很少（不超过200）。虽然您无法将所有行本身加载到内存中，但您可以轻松加载所有位置。

use feature qw( say );

use Fcntl qw( SEEK_SET );

open(my $fh1, '<', $qfn1) or die("Can't open \"$qfn1\": $!\n");
open(my $fh2, '<', $qfn2) or die("Can't open \"$qfn2\": $!\n");

my %offsets;
while (1) {
   my $offset = tell($fh1);
   my $row1 = <$fh1>;
   last if !defined($row1);

   chomp($row1);
   my @fields1 = split(/\t/, $row1);
   my $key = $fields1[1];
   $offsets{$key} = $offset;
}

while (my $row2 = <$fh2>) {
   chomp($row2);
   my @fields2 = split(/\t/, $row2);
   my $key = $fields2[1];
   my $offset = $offsets{$key};
   if (!defined($offset)) {
      warn("Key $key not found.\n");
      next;
   }

   seek($fh1, $offset, SEEK_SET);
   my $row1 = <$fh1>;
   chomp($row1);
   my @fields1 = split(/\t/, $row1);

   say join "\t", @fields2, @fields1[6..$#fields1];
}

这种方法也可以在Python中使用。

注意：如果订单更灵活，则存在一个更简单的解决方案（即，如果您在file1中订购记录时订购了输出，那么您可以正常使用）。假设file2容易适合记忆。

Answer 2

重要的是not to split不仅仅是必要的。如果你有足够的内存，将较小的文件放在哈希中，然后读取第二个文件就应该有效。

请考虑以下示例（请注意此脚本的运行时间包括创建示例数据所需的时间）：

#!/usr/bin/env perl

use strict;
use warnings;

# This is a string containing 10 lines corresponding to your "file one"
# Second column has the record ID
# Normally, you'd be reading this from a file

my $big_file = join "\n",
    map join("\t", 'x', $_, ('x') x 3_000_000),
    1 .. 10
;

# This is a string containing 10 lines corresponding to your "file two"
# Second column has the record ID

my $small_file = join "\n",
    map join("\t", 'y', $_, ('y') x 10),
    1 .. 10
;

# You would normally pass file names as arguments

join_with_big_file(
    \$small_file,
    \$big_file,
);

sub join_with_big_file {
    my $small_records = load_small_file(shift);
    my $big_file = shift;

    open my $fh, '<', $big_file
        or die "Cannot open '$big_file': $!";

    while (my $line = <$fh>) {
        chomp $line;
        my ($first, $id, $rest) = split /\t/, $line, 3;
        print join("\t", $first, $id, $rest, $small_records->{$id}), "\n";
    }

    return;
}

sub load_small_file {
    my $file = shift;
    my %records;

    open my $fh, '<', $file
        or die "Cannot open '$file' for reading: $!";

    while (my $line = <$fh>) {
        # limit the split
        my ($first, $id, $rest) = split /\t/, $line, 3;
        # I drop the id field here so it is not duplicated in the joined
        # file. If that is not a problem, $records{$id} = $line
        # would be better.
        $records{$id} = join("\t", $first, $rest);
    }

    return \%records;
}

Answer 3

300万列数据，是吗？听起来你正在做一些NLP工作。

假设这是真的，并且你的矩阵是稀疏的，python可以很好地处理它（只是没有pandas）。看scipy.sparse。例如：

from scipy.sparse import dok_matrix

A = dok_matrix((10,10))
A[1,1] = 1

B = dok_matrix((10,10))
B[2,2] = 2

print A+B

DOK代表＆＃34;密钥字典＆＃34;，通常用于构建稀疏矩阵，然后根据用途将其转换为CSR等。请参阅available sparse matrix types。

使用awk或perl按键对文件进行排序，就像没有预先分类的连接一样

3 个答案: