我有多个文件,我需要将它合并为单个文件,并将第二列(来自所有其他文件)添加到第一个文件中。 我的文件看起来像这样,
Nur of input reads | 33
Ave input read length | 20
UNIQUE READS:
Uni number | 25
Uni reads % | 74.40%
并且所有其他文件具有与上面相同的格式我希望将所有其他文件中的第二列添加到第一个文件并将其作为一个文件创建,如下所示,
sample_1 sample_2 ..... sample_n
Number 340 570 490
Average 201 201 201
niquely number 27096 29788 39870
% 79.60% 80.1% 70 %
我在unix中试过
`paste file_1 file_2 ....file_n`
但是结果文件看起来很笨拙,也没有添加标题文件名的标题。感谢Perl或python中的任何解决方案.. 谢谢
答案 0 :(得分:1)
在perl中,也许是这样的:
#!/usr/bin/perl
use strict;
use warnings;
my %data;
my @headers = ( "Number", "Average", "niquely number", "%" );
#iterate files called "sample_*.txt"
foreach my $filename ( glob "sample_*.txt" ) {
#open them for reading
open( my $input, '<', $filename ) or die $!;
my %stuff;
while (<$input>) {
chomp; # strip trailing linefeeds
#split on "|"
my ( $key, $value ) = split '\|';
#strip leading/trailing whitespace from the key.
$key =~ s/^\s*//g;
$key =~ s/\s*$//g;
#insert into hash (does this need some whitespace cleaning too?)
$stuff{$key} = $value;
}
close($filename);
#insert into hash of hashes
$data{$filename} = \%stuff;
}
my @file_order = sort keys %data;
print join( "\t", "", @file_order ), "\n";
foreach my $key (@headers) {
print join( "\t", $key, map { $_->{$key} } @data{@file_order} ), "\n";
}
答案 1 :(得分:1)
Python-pandas解决方案
键是函数read_csv
:
df1 = pd.read_csv(files, names=column, sep='|', header=None, usecols=[1])
将name
设置为column
(来自变量的列表),不会将第一行读取为标题(header=None
)并且只读取第二列(usecols=[1]
) 。分隔符为'|'
。
第三行的值为NaN
,因此被df1 = df1.dropna()
删除。
然后df1
附加到df
,最后是从列表设置索引到输出df
。
import pandas as pd
import glob
idx = ['Number', 'Average', 'niquely number', '%']
df = pd.DataFrame()
i = 0
for files in glob.glob('dir/*.txt'):
i = i + 1
column = ['sample_' + str(i)]
df1 = pd.read_csv(files, names=column, sep='|', header=None, usecols=[1])
#print df1
#remove NaN value from df1
df1 = df1.dropna()
#concat df1 to df
df = pd.concat([df, df1], axis=1)
#add column idx do df
df['idx'] = pd.Series(idx, index=df.index)
#set index from column idx
df = df.set_index('idx')
#remove index name
del df.index.name
print df
输出:
sample_1 sample_2
Number 330 30
Average 201 201
niquely number 25 44
% 74.40% 54.40%
file 1.txt
Nur of input reads | 330
Ave input read length | 201
UNIQUE READS:
Uni number | 25
Uni reads % | 74.40%
FILE2.TXT
Nur of input reads | 30
Ave input read length | 201
UNIQUE READS:
Uni number | 44
Uni reads % | 54.40%