Question

标题有点长。

我想要的是从文件中获取一些行的方法。这些行应该是：

匹配模式。例如开头没有'X'，行中有字符串'CH'。
“CH”模式后面的前几个字符，可以分为几类。例如1,2,3。
每个类别我只需要一行（可能是第一行）。

文件更像是这样:(更接近我需要的）

    X1 CH<1>
      N_CH<1> \
    X2 CH<2>
      N_CH<2>xx1 \
    X3 CH<2>
      N_CH<2>xx2 \
    X4 CH<3>
      N_CH<2>xx3 \
      N_CH<3>xx4 \
    X5 CH<4>
      N_CH<3>xx5

输出：

N_CH<1> \
N_CH<2>xx1 \
N_CH<3>xx4 \

3类：1,2,3

请注意，忽略CH2的xx2和xx3行以及CH3行的xx5行。

我尝试使用sed，但只能通过以下方式一次完成一个类别：

sed -n '0,/CH2/ {/CH2/p}' file

并且不能以'X'排除行开头。

提前致谢。

Edit:

没有明确的列或字段用于排序。我所知道的只是'CH'之后的数字将所有行排序到不同的类别，我只需要每个类别中的一行。

我理解'sed'和'sort'解决方案但其他更复杂的解决方案可能具有先进的功能，我需要进一步准备自己以了解所有细节机制。但是感谢所有答案！

所以这条线有效：

 sed -ne '/^[^X]/ {/N_CH/ p}' file | sort -t'>' -uk1,1

Answer 1

这可能对您有用：

sed '/^X/d' file | sort -uk1,1

上面的sed命令可以删除以X开头的行，然后可以通过第一个字段（-k1,1）对结果文件进行排序，并删除除第一个重复项之外的所有文件（{ {1}}）。

N.B。根据排序键删除重复项。

Answer 2

我会像这样解决它：

迭代您的文件，模式匹配您正在寻找的记录。
将匹配插入哈希 - 键入类别。
从哈希

这样的事情：

#!/usr/bin/env perl

use strict;
use warnings;
use Data::Dumper;

my %categories;

#use 'magic' filehandle to read from either STDIN or 
#file specified on command line as arg. 
while (<>) {
    #skip lines where the first (non whitespace) character is an X. 
    next if m/^\s*X/;
    #capture two 'chunks' -the 'category id' and line - strips leading whitespace
    #only proceed if capture works. 
    if ( my ( $item, $category ) = m/(\w+CH(\d+).*)/ ) {
        #insert the captured "item" into the hash. 
        push( @{ $categories{$category} }, $item );
    }
}

#debugging
print Dumper \%categories;

foreach my $category ( sort keys %categories ) {
    #print first match in each category
    print $categories{$category}->[0], "\n";
    #could instead:
    #print join ( "\n", @{$categories{$category}}),"\n";
    #to print all
}

打印调试（注释掉要删除的Dumper行）：

$VAR1 = {
          '3' => [
                   'N_CH3 xx4 \\',
                   'N_CH3 xx5 \\'
                 ],
          '1' => [
                   'N_CH1 \\'
                 ],
          '2' => [
                   'N_CH2 xx1 \\',
                   'N_CH2 xx2 \\',
                   'N_CH2 xx3 \\'
                 ]
        };

和'输出'：

N_CH1 \
N_CH2 xx1 \
N_CH3 xx4 \

我认为这是理想的结果？

注意：我并不完全清楚你想要的'类别匹配'是什么/什么，所以它捕获并将它们分组。

你可以改为：

while (<>) {
    next if m/^\s*X/;
    #test and assign regex matches
    if ( my ( $item, $category ) = m/(\w+CH(\d+).*)/ ) {
         #add "item" to category ONLY if it isn't already defined. 
         # //= is defined-equals assignment. 
         $categories{$category} //= $item;
    }
}
#print categories in order. 
foreach my $category ( sort keys %categories ) {
    print $categories{$category}, "\n";
}

Answer 3

在python中，您可以使用dictionaries轻松完成。

x=r"""X1 CH1
   N_CH1 \
X2 CH2
   N_CH2 xx1 \
X3 CH2
  N_CH2 xx2 \
X4 CH3
   N_CH2 xx3 \
  N_CH3 xx4 \
X5 CH4
   N_CH3 xx5 \""""
print dict((j,i) for i,j in re.findall(r"(^\s*([^X].*?CH\S+).*$)",x,flags=re.M|re.I)).values()

输出：[' N_CH1 \\', ' N_CH3 xx5 \\"', ' N_CH2 xx3 \\']

Answer 4

我会用这样的东西：

#!/bin/bash

# Create an array to track seen categories
declare -A categories

agg() {
  if [[ "categories[$1]" != "" ]]; then
    categories[$1]="$@"
  fi  
}

Loop over the file to filter out the categories.
while read -r line; do
  echo "Elem: $line"
  agg $line
done < <(grep -v ^X test.so | sed 's/^\s\+//')

# Print out the array
for k in "${!categories[@]}"; do
  echo "$k -> ${categories[$k]}"
done

Answer 5

这是使用awk的解决方案

#!/usr/bin/awk -f

/^[^X].*CH/ {
    split(substr($0, index($0, "CH")+2), a, " ");
    if (!(a[1] in lines)) {
        lines[a[1]]=$0
    }
}

END {
    for (k in lines){
        print lines[k]
    }
}

我们的想法是在数组中存储为每个类别找到的第一行。然后我们输出最后找到的所有行。

使用您的示例文件输出：

$ awk -f so.awk file 
  N_CH1 \
  N_CH2 xx1 \
  N_CH3 xx4 \

Answer 6

就像你说的那样写它：

$ awk '$1!~/^X/ && /CH/ && !seen[$1]++' file
  N_CH1 \
  N_CH2 xx1 \
  N_CH3 xx4 \

请注意，根据您发布的示例输入，您可以使用更简单的输出获得相同的输出：

$ awk '/^ / && !seen[$1]++' file
  N_CH1 \
  N_CH2 xx1 \
  N_CH3 xx4 \

因此，如果您确实需要第一个解决方案，那么您可能需要更多考虑更好地代表您的真实数据的输入。

查找与参数化子模式匹配模式的行，但仅保留每个子模式的第一个匹配

6 个答案: