How can lexing efficiency be improved?

时间:2019-01-18 18:35:43

标签: performance prolog tokenize lexical-analysis

In parsing a large 3 gigabyte file with DCG, efficiency is of importance.

The current version of my lexer is using mostly the or predicate ;/2 but I read that indexing can help.

Indexing is a technique used to quickly select candidate clauses of a predicate for a specific goal. In most Prolog systems, indexing is done (only) on the first argument of the head. If this argument is instantiated to an atom, integer, float or compound term with functor, hashing is used to quickly select all clauses where the first argument may unify with the first argument of the goal. SWI-Prolog supports just-in-time and multi-argument indexing. See section 2.18.

Can someone give an example of using indexing for lexing and possibly explain how it improves efficiency?


Details

Note: I changed some of the names before coping the source code into this question. If you find a mistake feel free to edit it here or leave me a comment and I will gladly fix it.

Currently my lexer/tokenizer (based on mzapotoczny/prolog-interpreter parser.pl) is this

% N.B.
% Since the lexer uses "" for values, the double_quotes flag has to be set to `chars`.
% If double_quotes flag is set to `code`, the the values with "" will not be matched.

:- use_module(library(pio)). 
:- use_module(library(dcg/basics)).
:- set_prolog_flag(double_quotes,chars).

lexer(Tokens) -->
   white_space,
   (
       (  ":",       !, { Token = tokColon }
      ;  "(",       !, { Token = tokLParen }
      ;  ")",       !, { Token = tokRParen }
      ;  "{",       !, { Token = tokLMusta}
      ;  "}",       !, { Token = tokRMusta}
      ;  "\\",      !, { Token = tokSlash}
      ;  "->",      !, { Token = tokImpl}
      ;  "+",       !, { Token = tokPlus }
      ;  "-",       !, { Token = tokMinus }
      ;  "*",       !, { Token = tokTimes }
      ;  "=",       !, { Token = tokEqual }
      ;  "<",       !, { Token = tokLt }
      ;  ">",       !, { Token = tokGt }
      ;  "_",       !, { Token = tokUnderscore }
      ;  ".",       !, { Token = tokPeriod }
      ;  "/",       !, { Token = tokForwardSlash }
      ;  ",",       !, { Token = tokComma }
      ;  ";",       !, { Token = tokSemicolon }
      ;  digit(D),  !,
            number(D, N),
            { Token = tokNumber(N) }
      ;  letter(L), !, identifier(L, Id),
            {  member((Id, Token), [ (div, tokDiv),
                                     (mod, tokMod),
                                     (where, tokWhere)]),
               !
            ;  Token = tokVar(Id)
            }
      ;  [_],
            { Token = tokUnknown }
      ),
      !,
      { Tokens = [Token | TokList] },
      lexer(TokList)
   ;  [],
         { Tokens = [] }
   ).

white_space -->
   [Char], { code_type(Char, space) }, !, white_space.
white_space -->
    "--", whole_line, !, white_space.
white_space -->
   [].

whole_line --> "\n", !.
whole_line --> [_], whole_line.

digit(D) -->
   [D],
      { code_type(D, digit) }.

digits([D|T]) -->
   digit(D),
   !,
   digits(T).
digits([]) -->
   [].

number(D, N) -->
   digits(Ds),
      { number_chars(N, [D|Ds]) }.

letter(L) -->
   [L], { code_type(L, alpha) }.

alphanum([A|T]) -->
   [A], { code_type(A, alnum) }, !, alphanum(T).
alphanum([]) -->
   [].

alphanum([]).
alphanum([H|T]) :- code_type(H, alpha), alphanum(T).

identifier(L, Id) -->
   alphanum(As),
      { atom_codes(Id, [L|As]) }.

Here are some helper predicates used for development and testing.

read_file_for_lexing_and_user_review(Path) :-
    open(Path,read,Input),
    read_input_for_user_review(Input), !,
    close(Input).

read_file_for_lexing_and_performance(Path,Limit) :-
    open(Path,read,Input),
    read_input_for_performance(Input,0,Limit), !,
    close(Input).

read_input(Input) :-
    at_end_of_stream(Input).

read_input(Input) :-
    \+ at_end_of_stream(Input),
    read_string(Input, "\n", "\r\t ", _, Line),
    lex_line(Line),
    read_input(Input).

read_input_for_user_review(Input) :-
    at_end_of_stream(Input).

read_input_for_user_review(Input) :-
    \+ at_end_of_stream(Input),
    read_string(Input, "\n", "\r\t ", _, Line),
    lex_line_for_user_review(Line),
    nl,
    print('Press spacebar to continue or any other key to exit: '),
    get_single_char(Key),
    process_user_continue_or_exit_key(Key,Input).

read_input_for_performance(Input,Count,Limit) :-
    Count >= Limit.

read_input_for_performance(Input,_,_) :-
    at_end_of_stream(Input).

read_input_for_performance(Input,Count0,Limit) :-
    % print(Count0),
    \+ at_end_of_stream(Input),
    read_string(Input, "\n", "\r\t ", _, Line),
    lex_line(Line),
    Count is Count0 + 1,
    read_input_for_performance(Input,Count,Limit).

process_user_continue_or_exit_key(32,Input) :-  % space bar
    nl, nl,
    read_input_for_user_review(Input).

process_user_continue_or_exit_key(Key) :-
    Key \= 32.

lex_line_for_user_review(Line) :-
    lex_line(Line,TokList),
    print(Line),
    nl,
    print(TokList),
    nl.

lex_line(Line,TokList) :-
    string_chars(Line,Code_line),
    phrase(lexer(TokList),Code_line).

lex_line(Line) :-
    string_chars(Line,Code_line),
    phrase(lexer(TokList),Code_line).

read_user_input_for_lexing_and_user_review :-
    print('Enter a line to parse or just Enter to exit: '),
    nl,
    read_string(user, "\n", "\r", _, String),
    nl,
    lex_line_for_user_review(String),
    nl,
    continue_user_input_for_lexing_and_user_review(String).

continue_user_input_for_lexing_and_user_review(String) :-
    string_length(String,N),
    N > 0,
    read_user_input_for_lexing_and_user_review.

continue_user_input_for_lexing_and_user_review(String) :-
    string_length(String,0).

read_user_input_for_lexing_and_user_review/0 allows a user to enter a string at the terminal for lexing and review the tokens.

read_file_for_lexing_and_user_review/1 Reads a file for lexing and review the tokens for each line one line at a time.

read_file_for_lexing_and_performance/2 Reads a file for lexing with a limit on the number of lines to lex. This is for use with gathering basic performance statistics to measure efficiency. Meant to be used with time/1.

2 个答案:

答案 0 :(得分:3)

解决方案:

您应该替换以下内容:

lexer(Tokens) -->
   white_space,
   (
      (  ":",       !, { Token = tokColon }
      ;  "(",       !, { Token = tokLParen }
      ;  ")",       !, { Token = tokRParen }
      ;  "{",       !, { Token = tokLMusta}
      ;  "}",       !, { Token = tokRMusta}
      ;  "\\",      !, { Token = tokSlash}
      ;  "->",      !, { Token = tokImpl}
      ;  "+",       !, { Token = tokPlus }
      ;  "-",       !, { Token = tokMinus }
      ;  "*",       !, { Token = tokTimes }
      ;  "=",       !, { Token = tokEqual }
      ;  "<",       !, { Token = tokLt }
      ;  ">",       !, { Token = tokGt }
      ;  "_",       !, { Token = tokUnderscore }
      ;  ".",       !, { Token = tokPeriod }
      ;  "/",       !, { Token = tokForwardSlash }
      ;  ",",       !, { Token = tokComma }
      ;  ";",       !, { Token = tokSemicolon }
      ;  digit(D),  !,
            number(D, N),
            { Token = tokNumber(N) }
      ;  letter(L), !, identifier(L, Id),
            {  member((Id, Token), [ (div, tokDiv),
                                     (mod, tokMod),
                                     (where, tokWhere)]),
               !
            ;  Token = tokVar(Id)
            }
      ;  [_],
            { Token = tokUnknown }
      ),
      !,
      { Tokens = [Token | TokList] },
      lexer(TokList)
   ;  [],
         { Tokens = [] }
   ).

使用

lexer(Tokens) -->
   white_space,
   (
      (
         op_token(Token), ! % replace ;/2 long chain searched blindly with call to new predicate op_token//1 which clauses have indexed access by first arg in Prolog standard way
      ;
         digit(D),  !, number(D, N),
         { Token = tokNumber(N) }
      ;  letter(L), !, identifier(L, Id),
         {  member((Id, Token), [ (div, tokDiv),
                                 (mod, tokMod),
                                 (where, tokWhere)]),
            !
      ;  Token = tokVar(Id)
         }
      ;  [_],
         { Token = tokUnknown }
      ),
      !,
      { Tokens = [Token | TokList] },
      lexer(TokList)
   ;
      [],
      { Tokens = [] }
   ).

%%%
op_token(tokColon)      --> ";".
op_token(tokLParen)     --> "(".
op_token(tokRParen)     --> ")".
op_token(tokLMusta)     --> "{".
op_token(tokRMusta)     --> "}".
op_token(tokBackSlash)  --> "\\".
op_token(tokImpl)       --> "->".
op_token(tokPlus)       --> "+".
op_token(tokMinus)      --> "-".
op_token(tokTimes)      --> "*".
op_token(tokEqual)      --> "=".
op_token(tokLt)         --> "<".
op_token(tokGt)         --> ">".
op_token(tokUnderscore) --> "_".
op_token(tokPeriod)     --> ".".
op_token(tokSlash)      --> "/".
op_token(tokComma)      --> ",".
op_token(tokSemicolon)  --> ";".

由Guy Coder编辑

我使用问题中发布的示例数据对列表进行了测试,列表中的每个项目都是转换为字符代码的数据中的一行。然后在列表中的每个项目上使用time / 1调用lexer,并对该列表重复测试10000次。将数据加载到列表中并在time / 1之前转换为字符代码的原因是为了使这些进程不会使结果倾斜。将这些运行中的每一次重复5次,以获得数据的一致性。

在下面的以下运行中,对于所有不同版本的词法分析器都进行了扩展,以覆盖所有7位ASCII字符,这大大增加了特殊字符的字符数。

用于以下目的的Prolog版本是SWI-Prolog 8.0。

对于问题中的版本。

Version: 1

:- set_prolog_flag(double_quotes,chars).

% 694,080,002 inferences, 151.141 CPU in 151.394 seconds (100% CPU, 4592280 Lips)
% 694,080,001 inferences, 150.813 CPU in 151.059 seconds (100% CPU, 4602271 Lips)
% 694,080,001 inferences, 152.063 CPU in 152.326 seconds (100% CPU, 4564439 Lips)
% 694,080,001 inferences, 151.141 CPU in 151.334 seconds (100% CPU, 4592280 Lips)
% 694,080,001 inferences, 151.875 CPU in 152.139 seconds (100% CPU, 4570074 Lips)

对于以上在此答案中发布的版本

Version: 2

:- set_prolog_flag(double_quotes,chars).

% 773,260,002 inferences, 77.469 CPU in 77.543 seconds (100% CPU, 9981573 Lips)
% 773,260,001 inferences, 77.344 CPU in 77.560 seconds (100% CPU, 9997705 Lips)
% 773,260,001 inferences, 77.406 CPU in 77.629 seconds (100% CPU, 9989633 Lips)
% 773,260,001 inferences, 77.891 CPU in 77.967 seconds (100% CPU, 9927511 Lips)
% 773,260,001 inferences, 78.422 CPU in 78.644 seconds (100% CPU, 9860259 Lips)

版本2通过使用版本1中的索引进行了显着改进。

在对代码进行进一步研究时,使用DCG查看listing/1并具有两个用于隐式传递状态表示形式的隐藏变量时,op_token显示:

op_token(tokUnderscore,['_'|A], A).

请注意,第一个参数不是要搜索的字符,在此answer中,索引代码写为

c_digit(0'0,0).

第一个参数是要搜索的字符,第二个参数是结果。

因此更改此

op_token(Token), !

对此

[S], { special_character_indexed(S,Token) }

带有索引子句的

special_character_indexed( ';' ,tokSemicolon).


    版本:3

:- set_prolog_flag(double_quotes,chars).

% 765,800,002 inferences, 74.125 CPU in 74.348 seconds (100% CPU, 10331197 Lips)
% 765,800,001 inferences, 74.766 CPU in 74.958 seconds (100% CPU, 10242675 Lips)
% 765,800,001 inferences, 74.734 CPU in 74.943 seconds (100% CPU, 10246958 Lips)
% 765,800,001 inferences, 74.828 CPU in 75.036 seconds (100% CPU, 10234120 Lips)
% 765,800,001 inferences, 74.547 CPU in 74.625 seconds (100% CPU, 10272731 Lips)

版本3的结果比版本2略好,但始终如一。

最后,正如AntonDanilov的评论中所述,将double_quotes标志更改为atom

Version: 4

:- set_prolog_flag(double_quotes,atom).

% 765,800,003 inferences, 84.234 CPU in 84.539 seconds (100% CPU, 9091300 Lips)
% 765,800,001 inferences, 74.797 CPU in 74.930 seconds (100% CPU, 10238396 Lips)
% 765,800,001 inferences, 75.125 CPU in 75.303 seconds (100% CPU, 10193677 Lips)
% 765,800,001 inferences, 75.078 CPU in 75.218 seconds (100% CPU, 10200042 Lips)
% 765,800,001 inferences, 75.031 CPU in 75.281 seconds (100% CPU, 10206414 Lips)

版本4与版本3几乎相同。

仅查看CPU编号,使用索引的速度就更快,例如(版本:1)151.875与(版本3)74.547

答案 1 :(得分:3)

一件事意味着这是愚蠢的代码:

token(T) -->
    ( "1", !, { T = one }
    ; "2", !, { T = two }
    ; "3", !, { T = three }
    )

这不是那么愚蠢的代码:

token(T) --> one_two_three(T).

one_two_three(one) --> "1".
one_two_three(two) --> "2".
one_two_three(three) --> "3".

但是仍然不是很好。也许更好:

token(T) --> [X], { one_two_three(X, T) }.

one_two_three(0'1, one).
one_two_three(0'2, two).
one_two_three(0'3, three).

最后一个例子看上去也很愚蠢,但请记住,现在您已经在第一个参数上建立了索引。您读一次,没有选择点,没有回溯。

但是,如果您想真正地知道如何有效地编写代码,则需要测量时间和空间的位置。你测量了吗?

但是,如果您真的想知道如何解决问题,可以阅读“ Prolog入门”,虽然我听不懂这本书的全部内容,但我记得它在DCG上有很大的内容。

但是,如果您真的想解析此类格式的大文件,可能会找到其他语言的现有库,那么它可能比最快的Prolog快得多。