我想加入两个制表符分隔的文件,但它们的顺序不同。我知道awk是可行的,但我不知道怎么做。这是相当的玩具python代码(对于没有疯狂的解决方法,python对于这个任务而言内存效率太低):
import pandas as pd
from random import shuffle
a = ['bar','qux','baz','foo','spam']
df = pd.DataFrame({'nam':a,'asc':[1,2,3,4,5],'desc':[5,4,3,2,1]})
shuffle(a)
print(a)
dex = pd.DataFrame({'dex' : a})
df_b = pd.DataFrame({'VAL1' :[0,1,2,3,4,5,6]})
pd.merge(dex, df,left_on='dex',right_on='nam')[['asc','desc','nam']]
我有两个文件: 对于文件一,第2列包含每行的标识符,有5列我不需要,然后有大约300万列数据。
对于文件二,有12列,第二列包含不同顺序的相同标识符,以及其他ID。
我想将文件一排序为具有与文件二相同的标识符和顺序,其他列适当地重新排列。
文件一可能是几千兆字节。
使用awk和/或其他GNU工具是否更容易,或者我应该使用perl吗?
答案 0 :(得分:3)
如果file1
的大小为GB的数量级,并且您有300万列数据,则您的行数很少(不超过200)。虽然您无法将所有行本身加载到内存中,但您可以轻松加载所有位置。
use feature qw( say );
use Fcntl qw( SEEK_SET );
open(my $fh1, '<', $qfn1) or die("Can't open \"$qfn1\": $!\n");
open(my $fh2, '<', $qfn2) or die("Can't open \"$qfn2\": $!\n");
my %offsets;
while (1) {
my $offset = tell($fh1);
my $row1 = <$fh1>;
last if !defined($row1);
chomp($row1);
my @fields1 = split(/\t/, $row1);
my $key = $fields1[1];
$offsets{$key} = $offset;
}
while (my $row2 = <$fh2>) {
chomp($row2);
my @fields2 = split(/\t/, $row2);
my $key = $fields2[1];
my $offset = $offsets{$key};
if (!defined($offset)) {
warn("Key $key not found.\n");
next;
}
seek($fh1, $offset, SEEK_SET);
my $row1 = <$fh1>;
chomp($row1);
my @fields1 = split(/\t/, $row1);
say join "\t", @fields2, @fields1[6..$#fields1];
}
这种方法也可以在Python中使用。
注意:如果订单更灵活,则存在一个更简单的解决方案(即,如果您在file1
中订购记录时订购了输出,那么您可以正常使用)。假设file2
容易适合记忆。
答案 1 :(得分:2)
重要的是not to split不仅仅是必要的。如果你有足够的内存,将较小的文件放在哈希中,然后读取第二个文件就应该有效。
请考虑以下示例(请注意此脚本的运行时间包括创建示例数据所需的时间):
#!/usr/bin/env perl
use strict;
use warnings;
# This is a string containing 10 lines corresponding to your "file one"
# Second column has the record ID
# Normally, you'd be reading this from a file
my $big_file = join "\n",
map join("\t", 'x', $_, ('x') x 3_000_000),
1 .. 10
;
# This is a string containing 10 lines corresponding to your "file two"
# Second column has the record ID
my $small_file = join "\n",
map join("\t", 'y', $_, ('y') x 10),
1 .. 10
;
# You would normally pass file names as arguments
join_with_big_file(
\$small_file,
\$big_file,
);
sub join_with_big_file {
my $small_records = load_small_file(shift);
my $big_file = shift;
open my $fh, '<', $big_file
or die "Cannot open '$big_file': $!";
while (my $line = <$fh>) {
chomp $line;
my ($first, $id, $rest) = split /\t/, $line, 3;
print join("\t", $first, $id, $rest, $small_records->{$id}), "\n";
}
return;
}
sub load_small_file {
my $file = shift;
my %records;
open my $fh, '<', $file
or die "Cannot open '$file' for reading: $!";
while (my $line = <$fh>) {
# limit the split
my ($first, $id, $rest) = split /\t/, $line, 3;
# I drop the id field here so it is not duplicated in the joined
# file. If that is not a problem, $records{$id} = $line
# would be better.
$records{$id} = join("\t", $first, $rest);
}
return \%records;
}
答案 2 :(得分:1)
300万列数据,是吗?听起来你正在做一些NLP工作。
假设这是真的,并且你的矩阵是稀疏的,python可以很好地处理它(只是没有pandas
)。看scipy.sparse
。例如:
from scipy.sparse import dok_matrix
A = dok_matrix((10,10))
A[1,1] = 1
B = dok_matrix((10,10))
B[2,2] = 2
print A+B
DOK代表&#34;密钥字典&#34;,通常用于构建稀疏矩阵,然后根据用途将其转换为CSR等。请参阅available sparse matrix types。