Question

我有一个文件（可能是二进制文件），其中包含大多数不可打印的ASCII字符，作为八进制转储实用程序的输出，如下所示。

od  -a MyFile.log 
0000000  cr  nl esc   a soh nul esc   * soh   L soh nul nul nul nul nul
0000020 nul soh etx etx etx soh nul nul nul nul nul nul nul nul nul nul
0000040 nul nul nul nul nul nul nul nul nul nul nul nul nul nul nul nul
*
0000100 nul nul nul nul nul soh etx etx etx nul nul nul nul nul nul nul
0000120 nul nul nul nul nul nul nul nul nul nul nul nul nul nul nul nul
0000140 nul nul nul nul nul nul nul nul soh etx etx etx soh nul nul nul
0000160 nul nul nul nul nul nul nul nul nul nul nul nul nul nul nul nul
0000200 nul nul nul nul nul nul nul nul nul nul nul soh etx etx etx etx
0000220 etx soh etx etx etx etx etx etx etx soh etx etx etx etx etx etx
0000240 etx soh etx etx etx etx etx soh soh soh soh soh nul nul nul nul
0000260 nul nul nul nul nul nul nul nul nul nul nul nul nul nul etx etx
0000300 nul nul nul nul nul nul nul nul nul nul nul nul nul nul nul nul

我想做以下事情：

将文件解析或分解为以字段esc，fs，gs和us开头的段落式部分（ASCII数字27 ，28,29和31）。
让输出文件包含八进制转储等人类可读的ASCII字符。
将结果存储在一个文件中。

这样做的最佳方法是什么？我更喜欢使用UNIX / Linux shell实用程序，例如grep，执行此任务而不是C程序。

感谢。

编辑我使用了八进制转储实用程序命令od -A n -a -v MyFile.log，以便从文件中删除偏移量，如下所示：

  cr  nl esc   a soh nul esc   * soh   L soh nul nul nul nul nul
 nul soh etx etx etx soh nul nul nul nul nul nul nul nul nul nul
 nul nul nul nul nul nul nul nul nul nul nul nul nul nul nul nul
 nul nul nul nul nul nul nul nul nul nul nul nul nul nul nul nul
 nul nul nul nul nul soh etx etx etx nul nul nul nul nul nul nul
 nul nul nul nul nul nul nul nul nul nul nul nul nul nul nul nul
 nul nul nul nul nul nul nul nul soh etx etx etx soh nul nul nul
 nul nul nul nul nul nul nul nul nul nul nul nul nul nul nul nul
 nul nul nul nul nul nul nul nul nul nul nul soh etx etx etx etx
 etx soh etx etx etx etx etx etx etx soh etx etx etx etx etx etx
 etx soh etx etx etx etx etx soh soh soh soh soh nul nul nul nul
 nul nul nul nul nul nul nul nul nul nul nul nul nul nul etx etx

我想继续或者将此文件传输到其他实用程序，例如AWK。

Answer 1

如果您可以访问支持RS中的正则表达式的awk（例如，gawk），您可以这样做：

awk 'BEGIN{ ORS = ""; RS = "\x1b|\x1c|\x1d|\x1f"; cmd = "od -a" }
    { print | cmd; close( cmd ) }' MyFile.log > output

这会将所有输出转储到单个文件中。如果你想在不同的输出文件中使用每个“段落”，你可以这样做：

awk 'BEGIN{ ORS=""; RS = "\x1b|\x1c|\x1d|\x1f"; cmd = "od -a" }
    { print | cmd "> output"NR }' MyFile.log

写入文件output1，output2等

请注意，awk的标准声明如果RS包含多个字符，则行为未指定，但awk的许多实现将支持这样的正则表达式。

Answer 2

od -a -An -v file | perl -0777ne 's/\n//g,print "$_\n " for /(?:esc| fs| gs| us)?(?:(?!esc| fs| gs| us).)*/gs'

od -a -An -v file→包含命名字符（-a）的文件的八进制转储，没有地址（-An），并且没有抑制重复行（-v）。
-0777→啜饮整个文件（行分隔符是不存在的0777字符）。
-n→使用隐式循环读取输入（整行1行） for /(?:esc| fs| gs| us)?(?:(?!esc| fs| gs| us).)*/gs→适用于/g，esc，fs或gs中可选的每个（us）部分，并包含最大序列不包含/s，esc，fs或gs的字符（包括换行符：us）。
s/\n//g→从od删除换行符 print "$_\n "→打印部分和换行符（以及匹配od格式的空格）

Answer 3

我认为更容易做的事情是灵活程序：

/*
 * This file is part of flex.
 * 
 * Redistribution and use in source and binary forms, with or without
 * modification, are permitted provided that the following conditions
 * are met:
 * 
 * 1. Redistributions of source code must retain the above copyright
 *    notice, this list of conditions and the following disclaimer.
 * 2. Redistributions in binary form must reproduce the above copyright
 *    notice, this list of conditions and the following disclaimer in the
 *    documentation and/or other materials provided with the distribution.
 * 
 * Neither the name of the University nor the names of its contributors
 * may be used to endorse or promote products derived from this software
 * without specific prior written permission.
 * 
 * THIS SOFTWARE IS PROVIDED ``AS IS'' AND WITHOUT ANY EXPRESS OR
 * IMPLIED WARRANTIES, INCLUDING, WITHOUT LIMITATION, THE IMPLIED
 * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
 * PURPOSE.
 */

    /************************************************** 
        start of definitions section

    ***************************************************/

%{
/* A template scanner file to build "scanner.c". */
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <getopt.h>
/*#include "parser.h" */

//put your variables here
char FileName[256];
FILE *outfile;
char inputName[256];


// flags for command line options
static int output_flag = 0;
static int help_flag = 0;

%}


%option 8bit 
%option nounput nomain noyywrap 
%option warn

%%
    /************************************************ 
        start of rules section

    *************************************************/


    /* these flex patterns will eat all input */ 
\x1B { fprintf(yyout, "\n\n"); }
\x1C { fprintf(yyout, "\n\n"); }
\x1D { fprintf(yyout, "\n\n"); }
\x1F { fprintf(yyout, "\n\n"); }
[:alnum:] { ECHO; }
.  { }
\n { ECHO; }


%%
    /**************************************************** 
        start of code section


    *****************************************************/

int main(int argc, char **argv);

int main (argc,argv)
int argc;
char **argv;
{
    /****************************************************
        The main method drives the program. It gets the filename from the
        command line, and opens the initial files to write to. Then it calls the lexer.
        After the lexer returns, the main method finishes out the report file,
        closes all of the open files, and prints out to the command line to let the
        user know it is finished.
    ****************************************************/

    int c;

    // the gnu getopt library is used to parse the command line for flags
    // afterwards, the final option is assumed to be the input file

    while (1) {
        static struct option long_options[] = {
            /* These options set a flag. */
            {"help",   no_argument,     &help_flag, 1},
            /* These options don't set a flag. We distinguish them by their indices. */

            {"useStdOut", no_argument,       0, 'o'},
            {0, 0, 0, 0}
        };
           /* getopt_long stores the option index here. */
        int option_index = 0;
        c = getopt_long (argc, argv, "ho",
            long_options, &option_index);

        /* Detect the end of the options. */
        if (c == -1)
            break;

        switch (c) {
            case 0:
               /* If this option set a flag, do nothing else now. */
               if (long_options[option_index].flag != 0)
                 break;
               printf ("option %s", long_options[option_index].name);
               if (optarg)
                 printf (" with arg %s", optarg);
               printf ("\n");
               break;

            case 'h':
                help_flag = 1;
                break;

            case 'o':
               output_flag = 1;
               break;

            case '?':
               /* getopt_long already printed an error message. */
               break;

            default:
               abort ();
            }
    }

    if (help_flag == 1) {
        printf("proper syntax is: cleaner [OPTIONS]... INFILE OUTFILE\n");
        printf("Strips non printable chars from input, adds line breaks on esc fs gs and us\n\n");
        printf("Option list: \n");
        printf("-o                      sets output to stdout\n");
        printf("--help                  print help to screen\n");
        printf("\n");
        printf("If infile is left out, then stdin is used for input.\n");
        printf("If outfile is a filename, then that file is used.\n");
        printf("If there is no outfile, then infile-EDIT is used.\n");
        printf("There cannot be an outfile without an infile.\n");
        return 0;
    }

    //get the filename off the command line and redirect it to input
    //if there is no filename then use stdin


    if (optind < argc) {
        FILE *file;

        file = fopen(argv[optind], "rb");
        if (!file) {
            fprintf(stderr, "Flex could not open %s\n",argv[optind]);
            exit(1);
        }
        yyin = file;
        strcpy(inputName, argv[optind]);
    }
    else {
        printf("no input file set, using stdin. Press ctrl-c to quit");
        yyin = stdin;
        strcpy(inputName, "\b\b\b\b\bagainst stdin");
    }

    //increment current place in argument list
    optind++;


    /********************************************
        if no input name, then output set to stdout
        if no output name then copy input name and add -EDIT.csv
        otherwise use output name

    *********************************************/
    if (optind > argc) {
        yyout = stdout;
    }   
    else if (output_flag == 1) {
        yyout = stdout;
    }
    else if (optind < argc){
        outfile = fopen(argv[optind], "wb");
        if (!outfile) {
                fprintf(stderr, "Flex could not open %s\n",FileName);
                exit(1);
            }
        yyout = outfile;
    }
    else {
        strncpy(FileName, argv[optind-1], strlen(argv[optind-1])-4);
        FileName[strlen(argv[optind-1])-4] = '\0';
        strcat(FileName, "-EDIT");
        outfile = fopen(FileName, "wb");
        if (!outfile) {
                fprintf(stderr, "Flex could not open %s\n",FileName);
                exit(1);
            }
        yyout = outfile;
    }

    yylex();
    if (output_flag == 0) {
        fclose(yyout);
    }
    printf("Flex program finished running file %s\n", inputName);
    return 0;
}

要编译Windows或Linux，请使用带flex和mingw的Linux框。然后在与上面的scanner.l文件相同的目录中运行此make文件。

TARGET = cleaner.exe
TESTBUILD = cleaner
LEX = flex
LFLAGS = -Cf
CC = i586-mingw32msvc-gcc
CFLAGS = -O -Wall 
INSTALLDIR = 

.PHONY: default all clean install uninstall cleanall

default: $(TARGET)

all: default install

OBJECTS = $(patsubst %.l, %.c, $(wildcard *.l))

%.c: %.l
    $(LEX) $(LFLAGS) -o $@ $<

.PRECIOUS: $(TARGET) $(OBJECTS)

$(TARGET): $(OBJECTS)
    $(CC) $(OBJECTS) $(CFLAGS) -o $@

linux: $(OBJECTS)
    gcc $(OBJECTS) $(CFLAGS) -o $(TESTBUILD)

cleanall: clean uninstall

clean:
    -rm -f *.c
    -rm -f $(TARGET)
    -rm -f $(TESTBUILD)

uninstall:
    -rm -f $(INSTALLDIR)/$(TARGET)

install:
    cp -f $(TARGET) $(INSTALLDIR)

在编辑并放置在您的路径上之后，只需使用od -A n -a -v MyFile.log | cleaner。

Answer 4

我写了一个简单的程序
的 main.c中：

#include <stdio.h> char *human_ch[]= { "NILL", "EOL" }; char code_buf[3]; // you can implement whatever you want for coversion to human-readable format const char *human_readable(int ch_code) { switch(ch_code) { case 0: return human_ch[0]; case '\n': return human_ch[1]; default: sprintf(code_buf,"%02x", (0xFF&ch_code) ); return code_buf; } } int main( int argc, char **argv) { int ch=0; FILE *ofile; if (argc<2) return -1; ofile=fopen(argv[1],"w+"); if (!ofile) return -1; while( EOF!=(ch=fgetc(stdin))) { fprintf(ofile,"%s",human_readable(ch)); switch(ch) { case 27: case 28: case 29: case 31: fputc('\n',ofile); //paragraph separator break; default: fputc(' ',ofile); //characters separator break; } } fclose(ofile); return 0; }

程序按字节读取stdin，并使用human_readable()函数将每个字节转换为用户指定的值。在我的示例中，我已经实现了jus EOL和NILL值，并且在所有其他方面，程序写入输出文件的十六进制代码字符
汇编：gcc main.c
程序用法：./a.out outfile <infile

Answer 5

这是一个小程序，可以做你想要的（至少是分裂位）：

#!/usr/bin/python

import sys

def main():
    if len(sys.argv) < 3:
        return

    name = sys.argv[1]
    codes = sys.argv[2]

    p = '%s.out.%%.4d' % name
    i = 1

    fIn = open(name, 'r')
    fOut = open(p % i, 'w')

    c = fIn.read(1)
    while c != '':
        fOut.write(c)
        c = fIn.read(1)

        if c != '' and codes.find(c) != -1:
            fOut.close()
            i = i + 1
            fOut = open(p % i, 'w')

    fOut.close()
    fIn.close()

if __name__ == '__main__':
    main()

用法：

python split.py file codes

e.g。

在bash命令行上：

python split.py input.txt $'\x1B'$'\x1C'

在指定的任何代码（在此示例中为127和128）上拆分input.txt.out.0001后，将生成文件input.txt.out.0002，input.txt，...

然后，您可以迭代这些文件，并将它们转换为可打印格式，方法是将它们传递给od。

for f in `ls input.txt.out.*`; do od $f > $f.od; done

解析包含不可打印ASCII字符的文件

5 个答案: