解析文本文件(未解决的问题)

时间:2018-07-22 16:52:42

标签: c parsing while-loop strtok

这是我认为已解决的问题,但显然我在这里和那里仍然有一些小错误。下面的代码是我使用我为原型微控制器开发的特定语言来解析文本文件的代码。基本上,每当我到达分号时,之后我都会将任何文本视为注释并忽略它:

   `//Get characters from .j FILE`
    while (fgets(line, 1000, IN) != NULL)
    {
        //Get each line of .j file


        //Compute length of each line
        len = strlen(line);

        //If length is zero or if there is newline escape sequnce
        if (len > 0 && line[len-1] == '\n')
        {
            //Replace with null
            line[len-1] = '\0';
        }

        //Search for semicolons in .J FILE
        semi_token = strpbrk(line, ";\r\n\t");

        //Replace with null terminator
        if (semi_token) 
        {
            *semi_token = '\0';
        }
        printf("line is %s\n",line );

        //Copy each line
        assign = line;

        // printf("line is %s\n",line );

        // len = strlen(line);

        // printf("line length is %d\n",len );

        // parse_tok = strtok(line, "\r ");

    }   

上面的代码是while循环,用于从文本文件获取每一行。如果我有以下格式的文件,则一切正常:

;;
;; Basic
;;

defun test arg3 arg2 arg1 min return 
;defun love arg2 arg1 * return
;defun func_1 6 6 eq return
;defun func_2 20 100 / return

defun main
0 -200 55 test printnum endl
;1 2 3 test printnum endl
;38 23 8 test printnum endl
;5 6 7 love printnum endl
;love printnum endl
;func_1 printnum endl
;func_2 printnum endl
return

观察输出:

line is 
line is 
line is 
line is 
line is defun test arg3 arg2 arg1 min return 
line is 
line is 
line is 
line is 
line is defun main
line is 0 -200 55 test printnum endl
line is 
line is 
line is 
line is 
line is 
line is 
line is return

问题在于当嵌套语句的情况下我的文本文件具有选项卡时:

;;
;; program to test nested ifs
;;

defun testIfs ;; called with one parameter n

arg1           ; get n to the top of the stack

dup 16 gt
if   ; 16 > n

    dup 8 gt
    if  ; 8 > n

        dup 4 gt
    if  ; 4 > n
        0
    else        ; 4 <= n
        1
    endif

    else        ; 8 <= n
       2
    endif

else        ; 16 <= n

     dup 24 gt
     if ; 24 > n

        dup 20 gt
    if  ; 20 > n
           3
        else        ; 20 <= n
           4
    endif

     else           ; 24 <= n

        dup 32 gt
        if  ; 32 > n
           5
    else
        -10
        endif

     endif

endif

return


defun main 
5 testIfs printnum endl
11 testIfs printnum endl
28 testIfs printnum endl
35 testIfs printnum endl
return

观察输出:

line is 
line is 
line is 
line is 
line is defun testIfs 
line is 
line is arg1           
line is 
line is dup 16 gt
line is if   
line is 
line is     dup 8 gt
line is     if
line is 
line is     
line is 
line is 
line is 
line is 
line is 
line is 
line is     else
line is        2
line is     endif
line is 
line is else
line is 
line is      dup 24 gt
line is      if
line is 
line is      
line is 
line is            3
line is         else   
line is            4
line is 
line is 
line is      else   
line is 
line is         dup 32 gt
line is         if
line is            5
line is 
line is 
line is         endif
line is 
line is      endif
line is 
line is endif
line is 
line is return
line is 
line is 
line is defun main 
line is 5 testIfs printnum endl
line is 11 testIfs printnum endl
line is 28 testIfs printnum endl
line is 35 testIfs printnum endl
line is return

如您所见,它跳过(看似随机地)某些被制表的行,但我不知道为什么这样做。在我的代码中需要修改什么,以便它不会随机跳过某些选项卡式的行?任何帮助表示赞赏!

2 个答案:

答案 0 :(得分:4)

这里是寻找分号的部分:

    //Search for semicolons in .J FILE
    semi_token = strpbrk(line, ";\r\n\t");

它明确地将制表符与分号相同,即开始注释。至于该错误并非总是发生的原因-我想有时候您的编辑器会将\t输入文件中的制表符(*.J)转换为空格。

答案 1 :(得分:1)

正如其他人指出的那样,您对strpbrk (line, ";\r\n\t");的使用将返回指向';', '\r', '\n'中第一个\t'line的指针。如果您的文件包含缩进的 tab字符(除非它是Makefile,否则不应该),那么您可能在一开始会 nul-terminate 。这不是你想要的。

但是,您选择的strpbrk是完成任务的理想选择。如果从 accept 字符集中删除'\t',则您将更接近实现您的预​​期目标。 (您可以删除'\r',并且行尾在读取时将转换为'\n'

在非常简单的代码版本中,您不必担心在最后一个非空白字符和注释的开头(或行尾)之间修剪任何尾随空白,您可以执行以下操作: nul-终止 strpbrk返回的指针处的行,例如

#include <stdio.h>
#include <string.h>

#define MAXC 1024

int main (void) {

    char line[MAXC] = "";
    size_t lineno = 0;

    /* read each line from stdin (e.g. redirect file, ./prog <file) */
    while (fgets(line, MAXC, stdin) != NULL)
    {
        char *p = NULL;         /* pointer for strchr return */

        /* Search for semicolons in line or newline */
        if ((p = strpbrk (line, ";\n")))
            *p = 0;             /* nul-terminate at ';' or '\n' */

        /* output line (single-quotes simply show trim of whitespace) */
        printf ("%3zu: '%s'\n", ++lineno, line);
    }

    return 0;
}

使用/输出示例

注意:输出中包含单引号,以说明左侧的尾随空白。

$ ./bin/parsesemisimple <dat/semicmtfile.txt
  1: ''
  2: ''
  3: ''
  4: ''
  5: 'defun testIfs '
  6: ''
  7: 'arg1           '
  ...

请注意,"arg1 ; get n to the top of the stack"arg1和注释字符之间如何有10个空格。离开悬挂的空白绝不是一个好主意。

要删除结尾的空格,可以包含ctype.h并使用其isspace函数测试注释前的任何字符是否为空格,如果有的话,只需继续备份直到找到后期的非空白字符。找到最后一个非空白字符后,您就在之后终止。

您可以有条件地向您的strpbrk添加几行代码。 注意:备份时,您始终要确保(p > line)以便不备份到line的开头,并且您还知道是否p不大于line,则注释从此处开始或为空行。您可以执行以下操作:

#include <ctype.h>
...
        /* Search for semicolons in line or newline */
        if ((p = strpbrk (line, ";\n"))) {
            if (p > line) {         /* test characters in line */
                /* remove all trailing whitespace */
                while (p > line && isspace (*--p)) {}
                *++p = 0;   /* nul-terminate after last non-whitespace char */
            }               /* before ';' or end of line */
            else
                *p = 0;     /* otherwise nul-terminate at ';' */
        }

(如果您不熟悉C Operator Precedence,现在将是一个与之交朋友的好机会。请注意说明关联是right to left还是left to right的列,会有所不同)

使用/输出示例

现在,您可以检查完整的输出并确认注释和所有尾随空格已被删除。 (您可以在满意的情况下删除单引号)

$ ./bin/parsesemicmt <dat/semicmtfile.txt
  1: ''
  2: ''
  3: ''
  4: ''
  5: 'defun testIfs'
  6: ''
  7: 'arg1'
  8: ''
  9: 'dup 16 gt'
 10: 'if'
 11: ''
 12: '    dup 8 gt'
 13: '    if'
 14: ''
 15: '        dup 4 gt'
 16: '    if'
 17: '        0'
 18: '    else'
 19: '        1'
 20: '    endif'
 21: ''
 22: '    else'
 23: '       2'
 24: '    endif'
 25: ''
 26: 'else'
 27: ''
 28: '     dup 24 gt'
 29: '     if'
 30: ''
 31: '        dup 20 gt'
 32: '    if'
 33: '           3'
 34: '        else'
 35: '           4'
 36: '    endif'
 37: ''
 38: '     else'
 39: ''
 40: '        dup 32 gt'
 41: '        if'
 42: '           5'
 43: '    else'
 44: '        -10'
 45: '        endif'
 46: ''
 47: '     endif'
 48: ''
 49: 'endif'
 50: ''
 51: 'return'
 52: ''
 53: ''
 54: 'defun main'
 55: '5 testIfs printnum endl'
 56: '11 testIfs printnum endl'
 57: '28 testIfs printnum endl'
 58: '35 testIfs printnum endl'
 59: 'return'

注意:如您注释掉的代码所示,如​​果要调用strtok,则不需要删除尾随空格。如果在标记line时包含一个空格作为标记之一,则所有连续出现的事件都将被视为单个标记并在那里删除。

仔细研究一下,如果您有任何疑问,请告诉我。如果我误解了您的问题,请告诉我们,我们很乐意进一步测试。