在终端

时间:2018-12-18 15:12:57

标签: bash split terminal

我有几个要在shell脚本中处理的文本文件(utf-8)。它们的格式并不完全相同,但是如果我只能将它们分解成可食用的块,我就可以解决。 可以用C或python编程,但是我不喜欢。

  

编辑:我用C写了一个解决方案。看到我自己的答案。我认为这毕竟可能是最简单的方法。如果您认为我错了,请对照下面我的答案中输入的更复杂的示例来测试您的解决方案。

     

-jcxz100

为清楚起见(并且能够更轻松地进行调试),我希望将这些块作为单独的文本文件保存在子文件夹中。

所有类型的输入文件包括:

  1. 垃圾线
  2. 带有垃圾文本的行,后跟起始括号或括号-即'[''{''<'或'('-并可能后面跟有效载荷
  3. 有效载荷行
  4. 在顶级对中嵌套括号或括号的行;也被视为有效载荷
  5. 带有结尾括号或括号的有效载荷行-即']''}''>'或')'-可能后跟某些内容(垃圾文本和/或新有效载荷的开头)

我只想根据匹配的顶级括号/括号对分解输入。 这些对中的有效负载不得更改(包括换行符和空格)。 顶级对之外的所有东西都应该作为垃圾丢弃。

双引号内的任何垃圾或有效载荷必须视为原子(作为原始文本处理,因此内部的任何方括号或括号也应视为文本)。

以下是一个示例(仅使用{}对):

junk text
"atomic junk"

some junk text followed by a start bracket { here is the actual payload
   more payload
   "atomic payload"
   nested start bracket { - all of this line is untouchable payload too
      here is more payload
      "yet more atomic payload; this one's got a smiley ;-)"
   end of nested bracket pair } - all of this line is untouchable payload too
   this is payload too
} trailing junk
intermittent junk
{
   payload that goes in second output file    }
end junk

...抱歉:某些输入文件确实如此混乱。

第一个输出文件应该是:

{ here is the actual payload
   more payload
   "atomic payload"
   nested start bracket { - all of this line is untouchable payload too
      here is more payload
      "yet more atomic payload; this one's got a smiley ;-)"
   end of nested bracket pair } - all of this line is untouchable payload too
   this is payload too
}

...和第二个输出文件:

{
   payload that goes in second output file    }

注意:

  • 我还没有决定是否需要保留输出中的开始/结束字符对,或者它们本身是否应该作为垃圾丢弃。 我认为保留它们的解决方案是更通用的方法。

  • 同一输入文件中可以混合使用顶级括号/括号对的类型。

  • 请注意:输入文件中包含*和$字符,因此请避免混淆bash;-)

  • 我更喜欢可读性,而不是简洁;但速度却不成指数。

必备之处:

  • 文本中有反斜杠转义的双引号;最好应该处理 (我有一个破解程序,但这并不漂亮)。

  • 该脚本不应打破垃圾和/或有效载荷中不匹配的方括号/括号对(请注意:在原子内必须必须!)

更美好的生活:

  • 我还没有看到它,但是人们可能会推测某些输入可能具有单引号而不是双引号来表示原子含量……甚至是两者的混合。

    < / li>
  • 如果可以轻松修改脚本以分析结构相似但具有不同开始/结束字符或字符串的输入,那就太好了。

我可以看到这是一个相当大的数目,但是我认为如果将其分解为更简单的问题,它就不会提供一个可靠的解决方案。

主要问题是正确地分割了输入-其他一切都可以被黑客忽略或“解决”,因此 随时可以忽略拥有美好的生活拥有更多美好的生活

2 个答案:

答案 0 :(得分:1)

给出:

$ cat file
junk text
"atomic junk"

some junk text followed by a start bracket { here is the actual payload
   more payload
   "atomic payload"
   nested start bracket { - all of this line is untouchable payload too
      here is more payload
      "yet more atomic payload; this one's got a smiley ;-)"
   end of nested bracket pair } - all of this line is untouchable payload too
   this is payload too
} trailing junk
intermittent junk
{
   payload that goes in second output file    }
end junk

此perl文件会将您描述的块提取到文件block_1block_2等中

#!/usr/bin/perl
use v5.10;
use warnings;
use strict;

use Text::Balanced qw(extract_multiple extract_bracketed);

my $txt;

while (<>){$txt.=$_;}  # slurp the file

my @blocks = extract_multiple(
    $txt,
    [
        # Extract {...}
        sub { extract_bracketed($_[0], '{}') },
    ],
    # Return all the fields
    undef,
    # Throw out anything which does not match
    1
);
chdir "/tmp";
my $base="block_";
my $cnt=1;
for my $block (@blocks){ my $fn="$base$cnt";
                         say "writing $fn";
                         open (my $fh, '>', $fn) or die "Could not open file '$fn' $!";
                         print $fh "$block\n";
                         close $fh;
                         $cnt++;}

现在文件:

$ cat block_1
{ here is the actual payload
   more payload
   "atomic payload"
   nested start bracket { - all of this line is untouchable payload too
      here is more payload
      "yet more atomic payload; this one's got a smiley ;-)"
   end of nested bracket pair } - all of this line is untouchable payload too
   this is payload too
}

$ cat block_2
{
   payload that goes in second output file    }

使用Text::Balanced是可靠的,并且可能是最好的解决方案。

可以使用单个Perl regex来完成块:

$ perl -0777 -nlE 'while (/(\{(?:(?1)|[^{}]*+)++\})|[^{}\s]++/g) {if ($1) {$cnt++; say "block $cnt:== start:\n$1\n== end";}}' file
block 1:== start:
{ here is the actual payload
   more payload
   "atomic payload"
   nested start bracket { - all of this line is untouchable payload too
      here is more payload
      "yet more atomic payload; this one's got a smiley ;-)"
   end of nested bracket pair } - all of this line is untouchable payload too
   this is payload too
}
== end
block 2:== start:
{
   payload that goes in second output file    }
== end

但这比使用像Text::Balanced ...这样的适当解析器要脆弱得多。

答案 1 :(得分:0)

I have a solution in C. It would seem there's too much complexity for this to be easily achieved in shell script. The program isn't overly complicated but nevertheless has more than 200 lines of code, which include error checking, some speed optimization, and other niceties.

Source file split-brackets-to-chunks.c:

#include <stdio.h>

/* Example code by jcxz100 - your problem if you use it! */

#define BUFF_IN_MAX 255
#define BUFF_IN_SIZE (BUFF_IN_MAX+1)

#define OUT_NAME_MAX 31
#define OUT_NAME_SIZE (OUT_NAME_MAX+1)

#define NO_CHAR '\0'

int main()
{
    char pcBuff[BUFF_IN_SIZE];
    size_t iReadActual;
    FILE *pFileIn, *pFileOut;
    int iNumberOfOutputFiles;
    char pszOutName[OUT_NAME_SIZE];
    char cLiteralChar, cAtomicChar, cChunkStartChar, cChunkEndChar;
    int iChunkNesting;
    char *pcOutputStart;
    size_t iOutputLen;

    pcBuff[BUFF_IN_MAX] = '\0';  /* ... just to be sure. */
    iReadActual = 0;
    pFileIn = pFileOut = NULL;
    iNumberOfOutputFiles = 0;
    pszOutName[OUT_NAME_MAX] = '\0';  /* ... just to be sure. */
    cLiteralChar = cAtomicChar = cChunkStartChar = cChunkEndChar = NO_CHAR;
    iChunkNesting = 0;
    pcOutputStart = (char*)pcBuff;
    iOutputLen = 0;

    if ((pFileIn = fopen("input-utf-8.txt", "r")) == NULL)
    {
        printf("What? Where?\n");
        return 1;
    }

    while ((iReadActual = fread(pcBuff, sizeof(char), BUFF_IN_MAX, pFileIn)) > 0)
    {
        char *pcPivot, *pcStop;

        pcBuff[iReadActual] = '\0'; /* ... just to be sure. */
        pcPivot = (char*)pcBuff;
        pcStop = (char*)pcBuff + iReadActual;

        while (pcPivot < pcStop)
        {
            if (cLiteralChar != NO_CHAR) /* Ignore this char? */
            {
                /* Yes, ignore this char. */

                if (cChunkStartChar != NO_CHAR)
                {
                    /* ... just write it out: */
                    fprintf(pFileOut, "%c", *pcPivot);
                }
                pcPivot++;
                cLiteralChar = NO_CHAR;

                /* End of "Yes, ignore this char." */
            }
            else if (cAtomicChar != NO_CHAR) /* Are we inside an atomic string? */
            {
                /* Yup; we are inside an atomic string. */

                int bBreakInnerWhile;
                bBreakInnerWhile = 0;

                pcOutputStart = pcPivot;
                while (bBreakInnerWhile == 0)
                {
                    if (*pcPivot == '\\') /* Treat next char as literal? */
                    {
                        cLiteralChar = '\\'; /* Yes. */
                        bBreakInnerWhile = 1;
                    }
                    else if (*pcPivot == cAtomicChar) /* End of atomic? */
                    {
                        cAtomicChar = NO_CHAR; /* Yes. */
                        bBreakInnerWhile = 1;
                    }
                    if (++pcPivot == pcStop) bBreakInnerWhile = 1;
                }
                if (cChunkStartChar != NO_CHAR)
                {
                    /* The atomic string is part of a chunk. */
                    iOutputLen = (size_t)(pcPivot-pcOutputStart);
                    fprintf(pFileOut, "%.*s", iOutputLen, pcOutputStart);
                }

                /* End of "Yup; we are inside an atomic string." */
            }
            else if (cChunkStartChar == NO_CHAR) /* Are we inside a chunk? */
            {
                /* No, we are outside a chunk. */

                int bBreakInnerWhile;
                bBreakInnerWhile = 0;
                while (bBreakInnerWhile == 0)
                {
                    /* Detect start of anything interesting: */
                    switch (*pcPivot)
                    {
                        /* Start of atomic? */
                        case '"':
                        case '\'':
                            cAtomicChar = *pcPivot;
                            bBreakInnerWhile = 1;
                            break;

                        /* Start of chunk? */
                        case '{':
                            cChunkStartChar = *pcPivot;
                            cChunkEndChar = '}';
                            break;
                        case '[':
                            cChunkStartChar = *pcPivot;
                            cChunkEndChar = ']';
                            break;
                        case '(':
                            cChunkStartChar = *pcPivot;
                            cChunkEndChar = ')';
                            break;
                        case '<':
                            cChunkStartChar = *pcPivot;
                            cChunkEndChar = '>';
                            break;
                    }
                    if (cChunkStartChar != NO_CHAR)
                    {
                        iNumberOfOutputFiles++;
                        printf("Start '%c' '%c' chunk (file %04d.txt)\n", *pcPivot, cChunkEndChar, iNumberOfOutputFiles);
                        sprintf((char*)pszOutName, "output/%04d.txt", iNumberOfOutputFiles);
                        if ((pFileOut = fopen(pszOutName, "w")) == NULL)
                        {
                            printf("What? How?\n");
                            fclose(pFileIn);
                            return 2;
                        }
                        bBreakInnerWhile = 1;
                    }
                    else if (++pcPivot == pcStop)
                    {
                        bBreakInnerWhile = 1;
                    }
                }

                /* End of "No, we are outside a chunk." */
            }
            else
            {
                /* Yes, we are inside a chunk. */

                int bBreakInnerWhile;
                bBreakInnerWhile = 0;

                pcOutputStart = pcPivot;
                while (bBreakInnerWhile == 0)
                {
                    if (*pcPivot == cChunkStartChar)
                    {
                        /* Increase level of brackets/parantheses: */
                        iChunkNesting++;
                    }
                    else if (*pcPivot == cChunkEndChar)
                    {
                        /* Decrease level of brackets/parantheses: */
                        iChunkNesting--;
                        if (iChunkNesting == 0)
                        {
                            /* We are now outside chunk. */
                            bBreakInnerWhile = 1;
                        }
                    }
                    else
                    {
                        /* Detect atomic start: */
                        switch (*pcPivot)
                        {
                            case '"':
                            case '\'':
                                cAtomicChar = *pcPivot;
                                bBreakInnerWhile = 1;
                                break;
                        }
                    }
                    if (++pcPivot == pcStop) bBreakInnerWhile = 1;
                }
                iOutputLen = (size_t)(pcPivot-pcOutputStart);
                fprintf(pFileOut, "%.*s", iOutputLen, pcOutputStart);
                if (iChunkNesting == 0)
                {
                    printf("File done.\n");
                    cChunkStartChar = cChunkEndChar = NO_CHAR;
                    fclose(pFileOut);
                    pFileOut = NULL;
                }

                /* End of "Yes, we are inside a chunk." */
            }
        }
    }
    if (cChunkStartChar != NO_CHAR)
    {
        printf("Chunk exceeds end-of-file. Exiting gracefully.\n");
        fclose(pFileOut);
        pFileOut = NULL;
    }

    if (iNumberOfOutputFiles == 0) printf("Nothing to do...\n");
    else printf("All done.\n");
    fclose(pFileIn);
    return 0;
}

I've solved the nice-to-haves and one of the more-far-out-nice-to-haves. To show this the input is a little more complex than the example in the question:

junk text
"atomic junk"

some junk text followed by a start bracket { here is the actual payload
   more payload
   'atomic payload { with start bracket that should be ignored'
   nested start bracket { - all of this line is untouchable payload too
      here is more payload
"this atomic has a literal double-quote \" inside"
      "yet more atomic payload; this one's got a smiley ;-) and a heart <3"
   end of nested bracket pair } - all of this line is untouchable payload too
   this is payload too
   "here's a totally unprovoked $ sign and an * asterisk"
} trailing junk
intermittent junk
<
   payload that goes in second output file } mismatched end bracket should be ignored     >
end junk

Resulting file output/0001.txt:

{ here is the actual payload
   more payload
   'atomic payload { with start bracket that should be ignored'
   nested start bracket { - all of this line is untouchable payload too
      here is more payload
"this atomic has a literal double-quote \" inside"
      "yet more atomic payload; this one's got a smiley ;-) and a heart <3"
   end of nested bracket pair } - all of this line is untouchable payload too
   this is payload too
   "here's a totally unprovoked $ sign and an * asterisk"
}

... and resulting file output/0002.txt:

<
   payload that goes in second output file } mismatched end bracket should be ignored     >

Thanks @dawg for your help :)