从Prolog中的大文件中提取文本

时间:2017-09-06 13:50:47

标签: file prolog

我想用SWI-Prolog在开始和结束字符串之间提取文本,例如来自Wikipedia转储的所有标题。我不想使用XML解析器,因为我想以相同的方式处理不同的文件类型。我让它适用于小文件,但遇到大文件的问题。

对于大文件(例如,Romanian Wikipedia)prolog内存不足(prolog -G1G -L1G -T1G -s main.pl -t main,请参阅下面的main.pl内容):

Welcome to SWI-Prolog (threaded, 64 bits, version 7.4.2)
SWI-Prolog comes with ABSOLUTELY NO WARRANTY. This is free software.
Please run ?- license. for legal details.

For online help and background, visit http://www.swi-prolog.org
For built-in help, use ?- help(Topic). or ?- apropos(Word).

found: 'Rocarta' 
found: 'Muzică' 
found: 'Iris (formație românească)' 
found: 'Pagina principală'
...[removed hundreds of lines]
found: 'Zadar' 
found: 'Australia' 
found: 'Slovenia' 
found: 'Croația'
ERROR: Out of global stack
   Exception: (5,861) between([60, 116, 105, 116, 108, 101, 62], [60, 47, 116, 105, 116, 108, 101, 62], _264890370, [10, 32, 32, 32, 32, 60, 110, 115|...], []) ?

如何使用大输入文件完成此任务?

MWE main.pl):

:- use_module(library(pio)).
:- use_module(library(dcg/basics)).
last_call_optimisation(true).

main :- 
    phrase_from_file(between(`<title>`, `</title>`, _), `wiki.xml`).

between(Start, End, Found) --> 
    string(_), string(Start), string(Found), string(End), 
    { format("found: '~s' \n", [Found]) }, 
    between(Start, End, _).
between(_, _, []) --> 
    remainder(_), 
    { format("finished parsing") }.

示例输入wiki.xml):

<mediawiki>
    >< Don't use an XML parser! ><
    <page><title>Albert Einstein</title></page>
    <page><title>Elvis Presley</title></page>
</mediawiki>

示例输出(预期):

found: 'Albert Einstein' 
found: 'Elvis Presley' 
finished parsing

修改: 如果我们从/ 3之间删除递归调用,则输出会发生变化,并且与我期望的不一致:

 found: 'Albert Einstein' 
 found: 'Albert Einstein</title></page>
     <page><title>Elvis Presley' 
 found: 'Elvis Presley' 
 finished parsing

1 个答案:

答案 0 :(得分:1)

这个构造

..., string(_), string(Start),  ...

非常效率低下。它将线性解析转换为指数解析。 但是我们有一个非常简单的解决方案,因为字符串文字在DCG中执行完全匹配:

:- use_module(library(dcg/basics)).

main(Titles) :-
  %phrase_from_file(between(`<title>`, `</title>`, Titles),`wiki.xml`).
  phrase(between(`<title>`, `</title>`, Titles), `
<mediawiki>
    >< Don't use an XML parser! ><
    <page><title>Albert Einstein</title></page>
    <page><title>Elvis Presley</title></page>
</mediawiki>
  `).


between(_Start, _End, []) --> [].
between(Start, End, [Found|Rest]) -->
    Start, string(String), End,
    { atom_codes(Found, String) },
    !, between(Start, End, Rest).
between(Start, End, List) --> [_], between(Start, End, List).

我会简化代码,但是:

...
phrase(tag(`title`, Titles), `
...

tag(_Tag, []) --> [].
tag(Tag, [Found|Rest]) -->
    "<", Tag, ">", string(String), "</", Tag, ">",
    { atom_codes(Found, String) },
    !, tag(Tag, Rest).
tag(Tag, List) --> [_], tag(Tag, List).

我敢打赌,在大文件上,这样效率稍高一些。 它也很容易概括:

...   短语(标签([titlefootnote],内容),`   ...

tags(_Tags, []) --> [].
tags(Tags, [Key-Found|Rest]) -->
    "<", {member(Tag, Tags)}, Tag, ">", string(String), "</", Tag, ">",
    { maplist(atom_codes, [Found,Key], [String,Tag]) },
    !, tags(Tags, Rest).
tags(Tags, List) --> [_], tags(Tags, List).

但效率不高。更好(但应该剖析证明它)

...
"<", string(Tag), ">", {memberchk(Tag, Tags)}, string(String), "</", Tag, ">",
...

修改:至少在一小部分代码上,"<", {member(Tag, Tags)}, Tag, ">"似乎要求的推断要少于"<", string(Tag), ">", {memberchk(Tag, Tags)},