按标题排序两个文件,使它们处于匹配的字段顺序

时间:2016-02-09 03:26:29

标签: r bash perl awk

我有两个文件,每个文件有700个字段,其中699/700个字段具有匹配的标题。我想重新排序字段,以便它们在两个文件中的顺序相同(尽管哪个顺序无关紧要)。例如,给定:

File1:
FRUIT MSMC1 MSMC24 MSMC2 MSMC10
Apple 1 2 3 2
Pear 2 1 4 5

File2:
VEG MSMC24 MSMC1 MSMC2 MSMC10
Onion 2 1 3 2
Radish 0 3 9 3

我希望两个文件都将第一个字段作为两个文件不相同的字段,然后两个文件中的其余字段按相同的顺序排列,例如一个可能的结果是:

File1:
FRUIT MSMC1 MSMC2 MSMC10 MSMC24
Apple 1 3 2 2
Pear 2 4 5 1

File2:
VEG MSMC1 MSMC2 MSMC10 MSMC24
Onion 1 3 2 2
Radish 3 9 3 0

4 个答案:

答案 0 :(得分:1)

使用data.table,这可以帮助您 首先阅读文件,

 library(data.table)
 dt1 <- fread("file1.csv")
 dt2 <- fread("file2.csv")

然后,获取字段的名称,常见字段

 ndt1 <- names(dt1)[-1]
 ndt2 <- names(dt2)[-1]
 common <- intersect(ndt1, ndt2)

现在你可以申请新订单

 setorder(dt1, c(ndt1[1], setdiff(ndt1, common), common))
 setorder(dt2, c(ndt2[1], setdiff(ndt2, common), common))

答案 1 :(得分:1)

一个perl解决方案,它将第一个文件保留为原样并写入第二个文件,其列的排列顺序与第一个文件的顺序相同。它读取命令行上提供的2个文件(遵循脚本名称)。

更新:添加了if (arguments[0].contains(":")) { // If the first argument contains colons String[] parts = arguments[0].split(":"); // Split the string at all colon characters int typeId; // The type ID try { typeId = Integer.parseInt(parts[0]); // Parse from the first string part } catch (NumberFormatException nfe) { // If the string is not an integer sender.sendMessage("The type ID has to be a number!"); // Tell the CommandSender return false; } byte data; // The data value try { data = Byte.parseByte(parts[1]); // Parse from the second string part } catch (NumberFormatException nfe) { sender.sendMessage("The data value has to be a byte!"); return false; } Material material = Material.getMaterial(typeId); // Material will be null if the typeId is invalid! // Get the block whose type ID and data value you want to change if (material != null) { block.setType(material); block.setData(data); // Deprecated method } else { sender.sendMessage("Invalid material ID!"); } } 短语,以允许第二个文件成为第一个文件的子集。回答他的问题如果一个文件是另一个文件的子集(文件1中的所有列都不在file2中),如何修改这些答案? - theo4786

map $_ // (),

输出是:

#!/usr/bin/perl
use strict;
use warnings;

# commandline: perl script_name.pl fruits.csv veg.csv

my (undef, @fruit_hdrs) = split ' ', <> and close ARGV;

my @veg_hdrs;

while (<>) {
    my ($name, @cols) = split;

    # only executes for the first line (header line) of second file
    @veg_hdrs = @cols unless @veg_hdrs;

    my %line;
    @line{ @veg_hdrs } = @cols;

    print join(" ", $name, map $_ // (), @line{ @fruit_hdrs } ), "\n";
}

答案 2 :(得分:0)

在perl中,此作业的工具是哈希切片。

您可以将哈希值视为@hash{@keys}

这样的事情:

#!/usr/bin/env perl

use strict;
use warnings;
use Data::Dumper;

my @headers; 
my $type; 

my @rows; 

#iterate data - would do this with a normal 'open'
while ( <DATA> ) {
  #set headers if the leading word is all upper case 
  if ( m/^[A-Z]+\s/ ) { 
      #seperate out type (VEG/FRUIT) from the other headings. 
      chomp ( ( $type, @headers ) = split ); 
      #print for debugging
      print Dumper \@headers;
  }
  else {
     #create a hash to store this row. 
     my %this_row;
     #split the row on whitespace, capturing name and ordered fields by header row. 
     ( my $name, @this_row{@headers} ) = split; 
     #insert name and type into the hash
     $this_row{name} = $name;
     $this_row{type} = $type;
     #print for debugging
     print Dumper \%this_row;
     #store it in @rows
     push ( @rows, \%this_row ); 
  }
}

#print output:
#header line
print join ("\t", "name", "type", @headers ),"\n";
#iterate rows, extract ordered by _last_ set of headers. 
foreach my $row ( @rows ) { 
    print join ( "\t", $row->{name}, $row->{type}, @{$row}{@headers} ),"\n";
}

__DATA__
FRUIT MSMC1 MSMC24 MSMC2 MSMC10
Apple 1 2 3 2
Pear 2 1 4 5
VEG MSMC24 MSMC1 MSMC2 MSMC10
Onion 2 1 3 2
Radish 0 3 9 3

注意 - 我已经使用Data::Dumper进行诊断 - 可以删除这些行,但我已经离开了它们,因为它说明了正在发生的事情。 同样从<DATA>读取 - 通常您打开文件句柄,或者只使用while ( <> ) {来读取STDIN或命令行中指定的文件。

输出的顺序基于“看到”的最后一个标题行 - 您当然可以对其进行排序,或对其重新排序。

如果您需要处理不匹配的列,则会在丢失的列上出错。在这种情况下,我们可以突破map以填充任何空白,并使用headers的哈希来确保我们捕获所有空格。

E.g;

#!/usr/bin/env perl

use strict;
use warnings;
use Data::Dumper;

my @headers; 
my %headers_combined; 
my $type; 

my @rows; 

#iterate data - would do this with a normal 'open'
while ( <DATA> ) {
  #set headers if the leading word is all upper case 
  if ( m/^[A-Z]+\s/ ) { 
      #seperate out type (VEG/FRUIT) from the other headings. 
      chomp ( ( $type, @headers ) = split ); 
      #add to hash of headers, to preserve uniques
      $headers_combined{$_}++ for @headers; 
      #print for debugging
      print Dumper \@headers;
  }
  else {
     #create a hash to store this row. 
     my %this_row;
     #split the row on whitespace, capturing name and ordered fields by header row. 
     ( my $name, @this_row{@headers} ) = split; 
     #insert name and type into the hash
     $this_row{name} = $name;
     $this_row{type} = $type;
     #print for debugging
     print Dumper \%this_row;
     #store it in @rows
     push ( @rows, \%this_row ); 
  }
}

#print output:
#header line
#note - extract keys from hash, not the @headers array. 
#sort is needed to order them, because default is unordered. 
print join ("\t", "name", "type", sort keys %headers_combined ),"\n";
#iterate rows, extract ordered by _last_ set of headers. 
foreach my $row ( @rows ) { 
    print join ( "\t", $row->{name}, $row->{type}, map { $row->{$_} // '' } sort keys %headers_combined ),"\n";
}

__DATA__
FRUIT MSMC1 MSMC24 MSMC2 MSMC10 OTHER
Apple 1 2 3 2 x
Pear 2 1 4 5 y 
VEG MSMC24 MSMC1 MSMC2 MSMC10 NOTHING
Onion 2 1 3 2 p
Radish 0 3 9 3 z

这里,map { $row->{$_} // '' } sort keys %headers_combined获取散列的所有键,按顺序返回它们,然后从行中提取该键 - 或者如果未定义则提供空白空间。 (多数民众赞成//做什么)

答案 3 :(得分:0)

这将重新排序file2中的字段以匹配file1中的顺序:

$ cat tst.awk
FNR==1 {
    fileNr++
    for (i=2;i<=NF;i++) {
        name2nr[fileNr,$i] =  i
        nr2name[fileNr,i]  = $i
    }
}
fileNr==2 { 
    printf "%s", $1
    for (i=2;i<=NF;i++) {
        printf "%s%s", OFS, $(name2nr[1,nr2name[2,i]])
    }
    print ""
}

$ awk -f tst.awk file1 file2
VEG MSMC1 MSMC24 MSMC2 MSMC10
Onion 1 2 3 2
Radish 3 0 9 3

使用GNU awk,您可以删除fileNr++行,并在其他地方使用ARGIND代替fileNr