我有一个具有以下结构的文本文件
ID,operator,a,b,c,d,true
WCBP12236,J1,75.7,80.6,65.9,83.2,82.1
WCBP12236,J2,76.3,79.6,61.7,81.9,82.1
WCBP12236,S1,77.2,81.5,69.4,84.1,82.1
WCBP12236,S2,68.0,68.0,53.2,68.5,82.1
WCBP12234,J1,63.7,67.7,72.2,71.6,75.3
WCBP12234,J2,68.6,68.4,41.4,68.9,75.3
WCBP12234,S1,81.8,82.7,67.0,87.5,75.3
WCBP12234,S2,66.6,67.9,53.0,70.7,75.3
WCBP12238,J1,78.6,79.0,56.2,82.1,84.1
WCBP12239,J2,66.6,72.9,79.5,76.6,82.1
WCBP12239,S1,86.6,87.8,23.0,23.0,82.1
WCBP12239,S2,86.0,86.9,62.3,89.7,82.1
WCBP12239,J1,70.9,71.3,66.0,73.7,82.1
WCBP12238,J2,75.1,75.2,54.3,76.4,84.1
WCBP12238,S1,65.9,66.0,40.2,66.5,84.1
WCBP12238,S2,72.7,73.2,52.6,73.9,84.1
每个ID
对应一个由操作员多次分析的数据集。即J1
和J2
是运营商J的第一次和第二次尝试。措施a
,b
,c
和d
略微使用4用于测量值的唯一算法,其值的真值位于列true
我想要做的是创建3个新文本文件,比较J1
与J2
,S1
vs S2
和J1
vs的结果S1
。 J1
与J2
的示例输出:
ID,operator,a1,a2,b1,b2,c1,c2,d1,d2,true
WCBP12236,75.7,76.3,80.6,79.6,65.9,61.7,83.2,81.9,82.1
WCBP12234,63.7,68.6,67.7,68.4,72.2,41.4,71.6,68.9,75.3
其中a1
是a
的衡量标准J1
等等。
另一个例子是S1
vs S2
:
ID,operator,a1,a2,b1,b2,c1,c2,d1,d2,true
WCBP12236,77.2,68.0,81.5,68.0,69.4,53.2,84.1,68.5,82.1
WCBP12234,81.8,66.6,82.7,67.9,67.0,53,87.5,70.7,75.3
ID不是按字母数字顺序排列的,也不会为相同的ID对运算符进行聚类。我不确定如何最好地完成这项任务 - 使用linux工具或像perl / python这样的脚本语言。
我最初尝试使用linux快速打砖墙
首先找到所有唯一ID(已排序)
awk -F, '/^WCBP/ {print $1}' file | uniq | sort -k 1.5n > unique_ids
循环浏览这些ID并对J1
,J2
:
foreach i (`more unique_ids`)
grep $i test.txt | egrep 'J[1-2]' | sort -t',' -k2
end
这给了我排序的数据
WCBP12234,J1,63.7,67.7,72.2,71.6,75.3
WCBP12234,J2,68.6,68.4,41.4,68.9,80.4
WCBP12236,J1,75.7,80.6,65.9,83.2,82.1
WCBP12236,J2,76.3,79.6,61.7,81.9,82.1
WCBP12238,J1,78.6,79.0,56.2,82.1,82.1
WCBP12238,J2,75.1,75.2,54.3,76.4,82.1
WCBP12239,J1,70.9,71.3,66.0,73.7,75.3
WCBP12239,J2,66.6,72.9,79.5,76.6,75.3
我不确定如何重新排列这些数据以获得所需的结构。我尝试在awk
循环foreach
awk 'BEGIN {RS="\n\n"} {print $1, $3,$10,$4,$11,$5,$12,$6,$13,$7}'
添加其他管道
有什么想法吗?我确信这可以使用awk
以不那么麻烦的方式完成,尽管使用适当的脚本语言可能会更好。
答案 0 :(得分:4)
您可以使用Perl csv模块Text::CSV提取字段,然后将它们存储在哈希中,其中ID是主键,第二个字段是辅助键,所有字段都存储为值。然后,做任何你想要的比较都应该是微不足道的。如果要保留行的原始顺序,可以在第一个循环中使用数组。
use strict;
use warnings;
use Text::CSV;
my %data;
my $csv = Text::CSV->new({
binary => 1, # safety precaution
eol => $/, # important when using $csv->print()
});
while ( my $row = $csv->getline(*ARGV) ) {
my ($id, $J) = @$row; # first two fields
$data{$id}{$J} = $row; # store line
}
答案 1 :(得分:1)
Python方式:
import os,sys, re, itertools
info=["WCBP12236,J1,75.7,80.6,65.9,83.2,82.1",
"WCBP12236,J2,76.3,79.6,61.7,81.9,82.1",
"WCBP12236,S1,77.2,81.5,69.4,84.1,82.1",
"WCBP12236,S2,68.0,68.0,53.2,68.5,82.1",
"WCBP12234,J1,63.7,67.7,72.2,71.6,75.3",
"WCBP12234,J2,68.6,68.4,41.4,68.9,80.4",
"WCBP12234,S1,81.8,82.7,67.0,87.5,75.3",
"WCBP12234,S2,66.6,67.9,53.0,70.7,72.7",
"WCBP12238,J1,78.6,79.0,56.2,82.1,82.1",
"WCBP12239,J2,66.6,72.9,79.5,76.6,75.3",
"WCBP12239,S1,86.6,87.8,23.0,23.0,82.1",
"WCBP12239,S2,86.0,86.9,62.3,89.7,82.1",
"WCBP12239,J1,70.9,71.3,66.0,73.7,75.3",
"WCBP12238,J2,75.1,75.2,54.3,76.4,82.1",
"WCBP12238,S1,65.9,66.0,40.2,66.5,80.4",
"WCBP12238,S2,72.7,73.2,52.6,73.9,72.7" ]
def extract_data(operator_1, operator_2):
operator_index=1
id_index=0
data={}
result=[]
ret=[]
for line in info:
conv_list=line.split(",")
if len(conv_list) > operator_index and ((operator_1.strip().upper() == conv_list[operator_index].strip().upper()) or (operator_2.strip().upper() == conv_list[operator_index].strip().upper()) ):
if data.has_key(conv_list[id_index]):
iters = [iter(conv_list[int(operator_index)+1:]), iter(data[conv_list[id_index]])]
data[conv_list[id_index]]=list(it.next() for it in itertools.cycle(iters))
continue
data[conv_list[id_index]]=conv_list[int(operator_index)+1:]
return data
ret=extract_data("j1", "s2")
print ret
O / P:
{'WCBP12239':['70 .9','86 .0','71 .3','86 .9','66 .0','62 .3','73 .7','89 .7','75 .3','82 .1'], 'WCBP12238':['72 .7','78 .6','73 .2','79 .0','52 .6','56 .2','73 .9','82 .1','72 .7','82 .1'],'WCBP12234': ['66 .6','63 .7','67 .9','67 .7','53 .0','72 .2','70 .7','71 .6','72 .7','75 .3'],'WCBP12236':['68 .0' ,'75 .7','68 .0','80 .6','53 .2','65 .9','68 .5','83 .2','82 .1','82 .1']}
答案 2 :(得分:1)
我没有像TLP那样使用Text :: CSV。如果你需要,但是对于这个例子,我想因为字段中没有嵌入的逗号,我对','进行了简单的拆分。此外,列出了两个运算符的真实字段(而不仅仅是1),因为我认为最后一个值的特殊情况使解决方案复杂化。
#!/usr/bin/perl
use strict;
use warnings;
use List::MoreUtils qw/ mesh /;
my %data;
while (<DATA>) {
chomp;
my ($id, $op, @vals) = split /,/;
$data{$id}{$op} = \@vals;
}
my @ops = ([qw/J1 J2/], [qw/S1 S2/], [qw/J1 S1/]);
for my $id (sort keys %data) {
for my $comb (@ops) {
open my $fh, ">>", "@$comb.txt" or die $!;
my $a1 = $data{$id}{ $comb->[0] };
my $a2 = $data{$id}{ $comb->[1] };
print $fh join(",", $id, mesh(@$a1, @$a2)), "\n";
close $fh or die $!;
}
}
__DATA__
WCBP12236,J1,75.7,80.6,65.9,83.2,82.1
WCBP12236,J2,76.3,79.6,61.7,81.9,82.1
WCBP12236,S1,77.2,81.5,69.4,84.1,82.1
WCBP12236,S2,68.0,68.0,53.2,68.5,82.1
WCBP12234,J1,63.7,67.7,72.2,71.6,75.3
WCBP12234,J2,68.6,68.4,41.4,68.9,75.3
WCBP12234,S1,81.8,82.7,67.0,87.5,75.3
WCBP12234,S2,66.6,67.9,53.0,70.7,75.3
WCBP12239,J1,78.6,79.0,56.2,82.1,82.1
WCBP12239,J2,66.6,72.9,79.5,76.6,82.1
WCBP12239,S1,86.6,87.8,23.0,23.0,82.1
WCBP12239,S2,86.0,86.9,62.3,89.7,82.1
WCBP12238,J1,70.9,71.3,66.0,73.7,84.1
WCBP12238,J2,75.1,75.2,54.3,76.4,84.1
WCBP12238,S1,65.9,66.0,40.2,66.5,84.1
WCBP12238,S2,72.7,73.2,52.6,73.9,84.1
生成的输出文件位于
之下J1 J2.txt
WCBP12234,63.7,68.6,67.7,68.4,72.2,41.4,71.6,68.9,75.3,75.3
WCBP12236,75.7,76.3,80.6,79.6,65.9,61.7,83.2,81.9,82.1,82.1
WCBP12238,70.9,75.1,71.3,75.2,66.0,54.3,73.7,76.4,84.1,84.1
WCBP12239,78.6,66.6,79.0,72.9,56.2,79.5,82.1,76.6,82.1,82.1
S1 S2.txt
WCBP12234,81.8,66.6,82.7,67.9,67.0,53.0,87.5,70.7,75.3,75.3
WCBP12236,77.2,68.0,81.5,68.0,69.4,53.2,84.1,68.5,82.1,82.1
WCBP12238,65.9,72.7,66.0,73.2,40.2,52.6,66.5,73.9,84.1,84.1
WCBP12239,86.6,86.0,87.8,86.9,23.0,62.3,23.0,89.7,82.1,82.1
J1 S1.txt
WCBP12234,63.7,81.8,67.7,82.7,72.2,67.0,71.6,87.5,75.3,75.3
WCBP12236,75.7,77.2,80.6,81.5,65.9,69.4,83.2,84.1,82.1,82.1
WCBP12238,70.9,65.9,71.3,66.0,66.0,40.2,73.7,66.5,84.1,84.1
WCBP12239,78.6,86.6,79.0,87.8,56.2,23.0,82.1,23.0,82.1,82.1
更新:要获得1个真值,for循环可以这样写:
for my $id (sort keys %data) {
for my $comb (@ops) {
local $" = '';
open my $fh, ">>", "@$comb.txt" or die $!;
my $a1 = $data{$id}{ $comb->[0] };
my $a2 = $data{$id}{ $comb->[1] };
pop @$a2;
my @mesh = grep defined, mesh(@$a1, @$a2);
print $fh join(",", $id, @mesh), "\n";
close $fh or die $!;
}
}
更新:为grep expr中的test添加了'defined'。因为它是正确的方式(而不仅仅是测试'$ _',它可能是0并且被grep错误地排除在列表中)。