您好我想使用AWK或Perl以下面的格式获取输出文件。我的输入文件是一个空格分隔的文本文件。这类似于我之前的一个问题,但在这种情况下输入和输出没有格式化。我的列位置可能会发生变化,因此会理解一种不参考列号的技术
输入文件
id quantity colour shape size colour shape size colour shape size
1 10 blue square 10 red triangle 12 pink circle 20
2 12 yellow pentagon 3 orange rectangle 4 purple oval 6
期望输出
id colour shape size
1 blue square 10
1 red triangle 12
1 pink circle 20
2 yellow pentagon 3
2 orange rectangle 4
2 purple oval 6
我正在使用Dennis Williamson的这段代码。唯一的问题是我得到的输出在转置字段中没有空格分隔。我需要一个空格分隔
#!/usr/bin/awk -f
BEGIN {
col_list = "quantity colour shape"
# Use a B ("blank") to add spaces in the output before or
# after a format string (e.g. %6dB), but generally use the numeric argument
# columns to be repeated on multiple lines may appear anywhere in
# the input, but they will be output together at the beginning of the line
repeat_fields["id"]
# since these are individually set we won't use B
repeat_fmt["id"] = "%-1s "
# additional fields to repeat on each line
ncols = split(col_list, cols)
for (i = 1; i <= ncols; i++) {
col_names[cols[i]]
forms[cols[i]] = "%-1s"
}
}
# save the positions of the columns using the header line
FNR == 1 {
for (i = 1; i <= NF; i++) {
if ($i in repeat_fields) {
repeat[++nrepeats] = i
repeat_look[i] = i
rformats[i] = repeat_fmt[$i]
}
if ($i in col_names) {
col_nums[++n] = i
col_look[i] = i
formats[i] = forms[$i]
}
}
# print the header line
for (i = 1; i <= nrepeats; i++) {
f = rformats[repeat[i]]
sub("d", "s", f)
gsub("B", " ", f)
printf f, $repeat[i]
}
for (i = 1; i <= ncols; i++) {
f = formats[col_nums[i]]
sub("d", "s", f)
gsub("B", " ", f)
printf f, $col_nums[i]
}
printf "\n"
next
}
{
for (i = 1; i <= NF; i++) {
if (i in repeat_look) {
f = rformats[i]
gsub("B", " ", f)
repeat_out = repeat_out sprintf(f, $i)
}
if (i in col_look) {
f = formats[i]
gsub("B", " ", f)
out = out sprintf(f, $i)
coln++
}
if (coln == ncols) {
print repeat_out out
out = ""
coln = 0
}
}
repeat_out = ""
}
输出
id quantitycolourshape
1 10bluesquare
1 redtrianglepink
2 circle12yellow
2 pentagonorangerectangle
我很抱歉没有提供有关实际文件的所有信息。我这样做只是为了简单起见,但它没有捕捉到我的所有要求。
在我的实际文件中,我希望转换字段n_cell和n_bsc for NODE SITE CHILD
NODE SITE CHILD n_cell n_bsc
答案 0 :(得分:3)
<>;
print("id colour shape size\n");
while (<>) {
my @combined_fields = split;
my $id = shift(@combined_fields);
while (@combined_fields) {
my @fields = ( $id, splice(@combined_fields, 0, 3) );
print(join(' ', @fields), "\n");
}
}
答案 1 :(得分:0)
您告诉我们您的真实数据包含超过5,000列,并且其列位可能会发生变化,我担心这还不够。
因此,在没有任何正确信息的情况下,我写了这个,它使用标题行来计算数据集的数量和大小,其中id
列是哪一列,第一组是哪一列启动。
它可以很好地处理您的示例数据,但我只能猜测它是否适用于您的实时文件。
use strict;
use warnings;
my @headers = split ' ', <>;
my %headers;
$headers{$_}++ for @headers;
die "Expected exactly one 'id' column" unless $headers{id} // 0 == 1;
my $id_index = 0;
$id_index++ while $headers[$id_index] ne 'id';
my @labels = grep $headers{$_} > 1, keys %headers;
my $set_size = @labels;
my $num_sets = $headers{$labels[0]};
my $start_index = 0;
$start_index++ while $headers[$start_index] ne $labels[0];
my @reformat;
while (<>) {
my @fields = split;
next unless @fields;
my $id = $fields[$id_index];
for (my $i = $start_index; $i < @fields; $i+=$set_size) {
push @reformat, [ $id, @fields[$i..$i + $set_size - 1] ];
}
}
unshift @labels, 'id';
print "@labels\n";
print "@$_\n" for @reformat;
<强>输出强>
id colour shape size
1 blue square 10
1 red triangle 12
1 pink circle 20
2 yellow pentagon 3
2 orange rectangle 4
2 purple oval 6