基本上,从每行输入的foo\tbar\tbaz
转到'bar'('foo', 'baz').
如果任何令牌包含单引号,则需要通过反斜杠进行转义:
don't
- > 'don\'t'
详细说明:
我的文件中包含了半结构化的'表格的句子组成部分:
the grand hall of the hong kong convention attend by some # guests
principal representatives of both countries seat on the central dais
representing china be mr jiang
britain be hrh
the principal representatives be more than # distinguished guests
hong kong end with the playing of the british national anthem
this follow at the stroke of midnight
both countries take part in the ceremony
the ceremony start at about # pm
the ceremony end about # am
# royal hong kong police officers lower the british hong kong flag
another # raise the sar flag
the # leave for the royal yacht britannia
the handover of hong kong hold by the chinese and british governments
the world cast eye on hong kong
the # governments hold on schedule
this be festival for the chinese nation
july # , # go in the annals of history
the hong kong compatriots become master of this chinese land
hong kong enter era of development
history remember mr deng xiaoping
it be along the course
we resolve the hong kong question
i wish to express thanks to all the personages
both china and britain contribute to the settlement of the hong kong
the world support hong kong 's return
i wish to extend my cordial greetings and best wishes
正如您所看到的,它们被制表符分隔。我想要做的是从这些数据创建正常的明确子句,将它们渲染为:
'attend by'('some # guests','the grand hall of the hong kong convention').
'take part in'('the ceremony','both countries').
be('representing china', 'mr jiang').
所以在现在的数据中,中间有一个动词短语,它应该成为这个新构造的基础,然后被作用的实体应该是主要参与者后面的第一个参数。
我希望这些最终可以在Prolog中使用。
我猜不是所有的数据都是完全形成的,所以也许我可以把它扔掉。
我想有某种奇特的perl脚本或正则表达式,sed,类型操作可以最有效地实现这一点。我需要在一个大型文件上执行此操作,因此我希望优化效率,这就是我在这里提出的原因。
答案 0 :(得分:1)
使用sed:
sed "s/\(.*\)\t\(.*\)\t\(.*\)/'\2'('\3', '\1')/" filename
为了保持令牌中没有空格的标记,使用awk会更简单:
awk -F\\t -vq="'" 'function quote(token) { if(index(token, " ")) { return q token q }; return token } { print quote($2) "(" quote($3) ", " quote($1) ")" }' filename
至于性能,我怀疑瓶颈是I / O,而不是这个程序。但是,如果它确实成为一个问题,那么你不想乱用脚本语言并将20行C ++拼凑起来。
编辑:回应评论(我对prolog有什么了解,嗯?:P),总是在引号内引用和引用撇号,awk再次更容易:
awk -F\\t -vq="'" 'function quote(token) { gsub(q, "\\"q, token); return q token q } { print quote($2) "(" quote($3) ", " quote($1) ")" }' filename
但也可以使用sed:
sed "s/'/\\\\'/g;s/\(.*\)\t\(.*\)\t\(.*\)/'\2'('\3', '\1')/" filename
在执行原始操作之前,这会将'
替换为\'
。引用Shell引用,这就是它需要这么多反斜杠的原因。
请注意,sed解决方案需要在每行中包含两个选项卡。看看测试输入,我不完全确定是这样的,所以awk对你来说可能是更好的选择。
答案 1 :(得分:0)
在SWI-Prolog中,考虑使用tokenize_atom / 2(您需要一个最新版本才能输入源任意长文本常量,并引用')
t :- Text = '
the grand hall of the hong kong convention attend by some # guests
principal representatives of both countries seat on the central dais
... rest of text...
the world support hong kong \'s return
i wish to extend my cordial greetings and best wishes',
tokenize_atom(Text,L), maplist(writeln,L).
产量
?- t.
the
grand
hall
of
the
hong
kong
...
所以你可以使用DCG来理解'文本。它比通过外部工具要容易得多,我猜......
编辑让我们的代码Boris'评价:
file_2_statements(File) :-
atom_codes('\t', Tab),
open(File, read, S),
repeat,
read_line_to_codes(S, L),
( L \= end_of_file
-> append([H,Tab,A1,Tab,A2], L),
maplist(atom_codes, [Hc,Ac1,Ac2], [H,A1,A2]),
P =.. [Hc,Ac1,Ac2], assert(P),
fail
; true
),
close(S).