Question

需要帮助sed / awk / grep /什么可以解决我的任务。我有一个大文件，我需要从中提取多个连续行。

我有开始模式：<DN>

和结束模式：</GR>

介于两者之间，如下所示：

<DN>234</DN>
<DD>sdfsd</DD>
<BR>456456</BR>
<COL>6575675 sdfsd</COL>

<RAC>456464</RAC>
<GR>sdfsdfsFFFDd</GR>

我试过这个：

sed -n '/\<DN\>/,/\<\/GR\>/p'

和其他几个（使用awk和sed）。它工作正常，但问题是源文件可能包含以<DN>开头但在行束末尾没有</GR>的行，然后在另一行中启动一个部分，最后正常：

<DN>234</DN> - unneded DN
<AB>sdfsd</AB>
<DC>456456</DC>
<EF>6575675 sdfsd</EF>
....really large piece of unwanted text here....

<DN>234</DN>
<DD>sdfsd</DD>
<BR>456456</BR>
<COL>6575675 sdfsd</COL>

<RAC>456464</RAC>
<GR>sdfsdfsFFFDd</GR>
<RAC>456464</RAC>
<GR>sdfsdfsFFFDd</GR>

如何只提取所需的行并忽略包含<DN>而不结尾</GR>的垃圾碎片？

接下来，我需要将多行作品从<DN>转换为</GR>到包含单行的文件，从<DN>开始，以</GR>结尾。任何帮助，将不胜感激。我被困了

Answer 1

这可能适合你（GNU sed）：

sed -n '/<DN>/{h;b};x;/./G;x;/<\/GR/{x;/./p;z;x}' file

使用保留空间存储<DN>和</GR>之间的行。

Answer 2

awk '
# Lines that start with '<DN>' start our matching.
/^<DN>/ {
    # If we saw a start without a matching end throw everything we've saved away.
    if (dn) {
        d=""
    }
    # Mark being in a '<DN>' element.
    dn=1
    # Save the current line.
    d=$0
    next
}

# Lines that end with '</GR>$' end our matching (but only if we are currently in a match).
dn && /<\/GR>$/ {
    # We aren't in a <DN> element anymore.
    dn=0
    # Print out the lines we've saved and the current line.
    printf "%s%s%s\n", d, OFS, $0
    # Reset our saved contents.
    d=""
    next
}

# If we are in a <DN> element and have saved contents append the current line to the contents (separated by OFS).
dn && d {
    d=d OFS $0
}
' file

Answer 3

awk '
  /^<DN>/ { n = 1 }

  n { lines[n++] = $0 }

  n && /<\/GR>$/ {
    for (i=1; i<n; i++) printf "%s", lines[i]
    print ""
    n = 0
  }
' file

Answer 4

使用bash：

fun () 
{ 
    local line output;
    while IFS= read -r line; do
        if [[ $line =~ ^'<DN>' ]]; then
            output=$line;
        else
            if [[ -n $output ]]; then
                output=$output$'\n'$line;
                if [[ $line =~ '</GR>'$ ]]; then
                    echo "$output";
                    output=;
                fi;
            fi;
        fi;
    done
}

fun <file

Answer 5

您可以使用pcregrep工具。

$ pcregrep -o -M '(?s)(?<=^|\s)<DN>(?:(?!<DN>).)*?</GR>(?=\n|$)' file
<DN>234</DN>
<DD>sdfsd</DD>
<BR>456456</BR>
<COL>6575675 sdfsd</COL>

<RAC>456464</RAC>
<GR>sdfsdfsFFFDd</GR>

在模式之间打印几行（第一个模式不是唯一的）

5 个答案: