将其他文件中的列合并为一个文件

时间:2015-09-28 09:18:11

标签: perl python-3.x numpy pandas

我有多个文件,我需要将它合并为单个文件,并将第二列(来自所有其他文件)添加到第一个文件中。 我的文件看起来像这样,

                      Nur of input reads    |   33
                    Ave input read length   |   20
                              UNIQUE READS:
                                Uni  number |   25
                               Uni  reads % |   74.40%

并且所有其他文件具有与上面相同的格式我希望将所有其他文件中的第二列添加到第一个文件并将其作为一个文件创建,如下所示,

               sample_1  sample_2  .....    sample_n
     Number      340        570      490
    Average        201       201      201
   niquely number  27096     29788    39870
       %           79.60%    80.1%     70 %     

我在unix中试过

`paste file_1 file_2 ....file_n`

但是结果文件看起来很笨拙,也没有添加标题文件名的标题。感谢Perl或python中的任何解决方案.. 谢谢

2 个答案:

答案 0 :(得分:1)

在perl中,也许是这样的:

#!/usr/bin/perl
use strict;
use warnings;

my %data;
my @headers = ( "Number", "Average", "niquely number", "%" );

#iterate files called "sample_*.txt"
foreach my $filename ( glob "sample_*.txt" ) {
    #open them for reading
    open( my $input, '<', $filename ) or die $!;

    my %stuff;
    while (<$input>) {
        chomp; # strip trailing linefeeds
        #split on "|"
        my ( $key, $value ) = split '\|';
        #strip leading/trailing whitespace from the key. 
        $key =~ s/^\s*//g;
        $key =~ s/\s*$//g;

        #insert into hash (does this need some whitespace cleaning too?)
        $stuff{$key} = $value;
    }
    close($filename);

    #insert into hash of hashes
    $data{$filename} = \%stuff;
}

my @file_order = sort keys %data;
print join( "\t", "", @file_order ), "\n";
foreach my $key (@headers) {
    print join( "\t", $key, map { $_->{$key} } @data{@file_order} ), "\n";
}

答案 1 :(得分:1)

Python-pandas解决方案

键是函数read_csv

df1 = pd.read_csv(files, names=column, sep='|', header=None, usecols=[1])

name设置为column(来自变量的列表),不会将第一行读取为标题(header=None)并且只读取第二列(usecols=[1]) 。分隔符为'|'

第三行的值为NaN,因此被df1 = df1.dropna()删除。 然后df1附加到df,最后是从列表设置索引到输出df

import pandas as pd
import glob

idx = ['Number', 'Average', 'niquely number', '%']
df = pd.DataFrame()
i = 0

for files in glob.glob('dir/*.txt'):

    i = i + 1
    column = ['sample_' + str(i)]

    df1 = pd.read_csv(files, names=column, sep='|', header=None, usecols=[1])
    #print df1   
    #remove NaN value from df1
    df1 = df1.dropna()
    #concat df1 to df
    df = pd.concat([df, df1], axis=1)

#add column idx do df
df['idx'] = pd.Series(idx, index=df.index)
#set index from column idx
df = df.set_index('idx')
#remove index name
del df.index.name

print df

输出:

                 sample_1   sample_2
Number                330         30
Average               201        201
niquely number         25         44
%                  74.40%     54.40%

file 1.txt

Nur of input reads    |   330
                    Ave input read length   |   201
                              UNIQUE READS:
                                Uni  number |   25
                               Uni  reads % |   74.40%

FILE2.TXT

Nur of input reads    |   30
                    Ave input read length   |   201
                  UNIQUE READS:
                                Uni  number |  44
                               Uni  reads % |   54.40%