我想用SWI-Prolog在开始和结束字符串之间提取文本,例如来自Wikipedia转储的所有标题。我不想使用XML解析器,因为我想以相同的方式处理不同的文件类型。我让它适用于小文件,但遇到大文件的问题。
对于大文件(例如,Romanian Wikipedia)prolog内存不足(prolog -G1G -L1G -T1G -s main.pl -t main
,请参阅下面的main.pl内容):
Welcome to SWI-Prolog (threaded, 64 bits, version 7.4.2)
SWI-Prolog comes with ABSOLUTELY NO WARRANTY. This is free software.
Please run ?- license. for legal details.
For online help and background, visit http://www.swi-prolog.org
For built-in help, use ?- help(Topic). or ?- apropos(Word).
found: 'Rocarta'
found: 'Muzică'
found: 'Iris (formație românească)'
found: 'Pagina principală'
...[removed hundreds of lines]
found: 'Zadar'
found: 'Australia'
found: 'Slovenia'
found: 'Croația'
ERROR: Out of global stack
Exception: (5,861) between([60, 116, 105, 116, 108, 101, 62], [60, 47, 116, 105, 116, 108, 101, 62], _264890370, [10, 32, 32, 32, 32, 60, 110, 115|...], []) ?
如何使用大输入文件完成此任务?
MWE (main.pl
):
:- use_module(library(pio)).
:- use_module(library(dcg/basics)).
last_call_optimisation(true).
main :-
phrase_from_file(between(`<title>`, `</title>`, _), `wiki.xml`).
between(Start, End, Found) -->
string(_), string(Start), string(Found), string(End),
{ format("found: '~s' \n", [Found]) },
between(Start, End, _).
between(_, _, []) -->
remainder(_),
{ format("finished parsing") }.
示例输入(wiki.xml
):
<mediawiki>
>< Don't use an XML parser! ><
<page><title>Albert Einstein</title></page>
<page><title>Elvis Presley</title></page>
</mediawiki>
示例输出(预期):
found: 'Albert Einstein'
found: 'Elvis Presley'
finished parsing
修改: 如果我们从/ 3之间删除递归调用,则输出会发生变化,并且与我期望的不一致:
found: 'Albert Einstein'
found: 'Albert Einstein</title></page>
<page><title>Elvis Presley'
found: 'Elvis Presley'
finished parsing
答案 0 :(得分:1)
这个构造
..., string(_), string(Start), ...
非常效率低下。它将线性解析转换为指数解析。 但是我们有一个非常简单的解决方案,因为字符串文字在DCG中执行完全匹配:
:- use_module(library(dcg/basics)).
main(Titles) :-
%phrase_from_file(between(`<title>`, `</title>`, Titles),`wiki.xml`).
phrase(between(`<title>`, `</title>`, Titles), `
<mediawiki>
>< Don't use an XML parser! ><
<page><title>Albert Einstein</title></page>
<page><title>Elvis Presley</title></page>
</mediawiki>
`).
between(_Start, _End, []) --> [].
between(Start, End, [Found|Rest]) -->
Start, string(String), End,
{ atom_codes(Found, String) },
!, between(Start, End, Rest).
between(Start, End, List) --> [_], between(Start, End, List).
我会简化代码,但是:
...
phrase(tag(`title`, Titles), `
...
tag(_Tag, []) --> [].
tag(Tag, [Found|Rest]) -->
"<", Tag, ">", string(String), "</", Tag, ">",
{ atom_codes(Found, String) },
!, tag(Tag, Rest).
tag(Tag, List) --> [_], tag(Tag, List).
我敢打赌,在大文件上,这样效率稍高一些。 它也很容易概括:
...
短语(标签([title
,footnote
],内容),`
...
tags(_Tags, []) --> [].
tags(Tags, [Key-Found|Rest]) -->
"<", {member(Tag, Tags)}, Tag, ">", string(String), "</", Tag, ">",
{ maplist(atom_codes, [Found,Key], [String,Tag]) },
!, tags(Tags, Rest).
tags(Tags, List) --> [_], tags(Tags, List).
但效率不高。更好(但应该剖析证明它)
...
"<", string(Tag), ">", {memberchk(Tag, Tags)}, string(String), "</", Tag, ">",
...
修改:至少在一小部分代码上,"<", {member(Tag, Tags)}, Tag, ">"
似乎要求的推断要少于"<", string(Tag), ">", {memberchk(Tag, Tags)},
。