Question

我正在尝试对看起来像这样的文件进行排序

MarkerName      Allele1 Allele2 Weight  Zscore  P-value Direction^Mrs217377     t       c       6806    1.121   0.2625  +++^Mrs4668077
  a       g       6806    -0.038  0.9696  --+^Mrs16855496 a       g       4106    -0.092  0.9268  ??-^Mrs217386   a       g       6806
    0.814   0.4158  +++^Mrs2075070  a       g       6806    -0.699  0.4844  --+^Mrs10187002 a       t       4106    0.099   0.9208  ??+^Mrs12785983 t       c       6806    -1.092  0.2747  +--^Mrs1100405  t       c       6806    -0.872  0.3831  +--^Mrs12155014 t       c
       6806    0.081   0.9358  ++-^Mrs2287619  t       c       6806    -2.221  0.02632 ---^M

在第七个空白区域后面有一个^M字符而不是一个简单的换行符。我不确定如何处理它或我是否可以忽略它。我试图按P值（第六列）对每一行进行排序。

像这样：

MarkerName      Allele1 Allele2 Weight  Zscore  P-value Direction
rs2287619       t       c       6806    -2.221  0.02632 ---
rs217377        t       c       6806    1.121   0.2625  +++
rs12785983      t       c       6806    -1.092  0.2747  +--
rs1100405       t       c       6806    -0.872  0.3831  +--
rs217386        a       g       6806    0.814   0.4158  +++
rs2075070       a       g       6806    -0.699  0.4844  --+
rs10187002      a       t       4106    0.099   0.9208  ??+
rs16855496      a       g       4106    -0.092  0.9268  ??-
rs4668077       a       g       6806    -0.038  0.9696  --+

到目前为止，我有这个Perl代码

use strict; 
use warnings;

die "Please specify a suitable text file\n" if (@ARGV != 1);
my ($infile) = @ARGV;

# create outputfile
my $outfile = "MetaAnalysis_Sorted.txt";

# create filehandles
open (my $in, " < $infile") or die "error reading $infile. $!";
open (my $out, " >> $outfile") or die "error creating $outfile. $!";


my @array;

while ( <$in> ) {
    chomp;  # removes newline
    push @array, $_;
    my @sorted = sort { (split '\s', $a)[5] <=> (split '\s', $b)[5] } @array;
    print $out join( "\n", @sorted )."\n\n";
}

close $in;
close $out;

我尝试使用dos2unix转换原始文件，但它没有用。

Answer 1

主要问题是您使用的是'\s'字面值而不是常规字面值表达。您可能需要一个或多个空格，即/s+/。

另一个问题是由于传递给P-value运算符的nun-numeric <=>。一世建议在调用sort之前将标题从数组中移除。

写入输出文件应该在while (<$in>)循环之外执行。

另外，我建议跳过空行：

while (<$in>) {
    chomp;  #removes new line
    push @array, $_ if $_;
}

这是一个固定版本：

use strict; use warnings;

die "Please specify a suitable text file\n" if (@ARGV != 1);
my ($infile) = @ARGV;

#create outputfile
my $outfile = "MetaAnalysis_Sorted.txt";

#create filehandles
open (my $in, " < $infile") or die "error reading $infile. $!";
open (my $out, " >> $outfile") or die "error creating $outfile. $!";


my @array;
while (<$in>) {
    chomp;  #removes new line
    push @array, $_ if $_;
}


my $head = shift @array;
print $out "$head\n";

my @sorted = sort {
  (split /\s+/, $a)[5] <=> (split /\s+/, $b)[5];
} @array;
print $out join( "\n", @sorted )."\n\n";

close $in;
close $out;

Answer 2

许多编辑器和文本实用程序使用序列^M来指示Ctrl-M或回车。看起来您的文件已保存，每行末尾只有一个回车符（CR）。这很不寻常。 Linux仅使用换行符（LF），而Windows使用两个字符CR LF。只有很旧的Macintosh系统才使用CR

正则表达式字符序列\R对于这种文件非常有用。它将匹配LF，CR LF或CR中的任何一个。不幸的是，您无法将输入行分隔符设置为正则表达式模式 - 它必须是文字字符串，因此您必须将整个文件读取为单个字符串然后使用split

这个程序显示了这个想法，但很难说出样本数据中可能出现任何空白行的位置，并且它也是在记录中间人为地换行。只要您为输入

提供未修改的数据文件，这应该可以正常工作

sort_by_col6.pl

use strict;
use warnings 'all';

my @infile = do {
    local $/;
    split /\R/, <>;
};

local $\ = "\n";

print shift @infile;    # Print header line
print for sort { (split ' ', $a)[5] <=> (split ' ', $b) } @infile;

您需要从命令提示符运行它，以便重定向输出

$ perl sort_by_col6.pl my_input.txt > MetaAnalysis_Sorted.txt

按列排序文件

2 个答案:

sort_by_col6.pl