取三个标签分隔的标记并制作一个Prolog“事实”

时间:2015-01-22 10:14:54

标签: regex perl sed prolog nlp

基本上,从每行输入的foo\tbar\tbaz转到'bar'('foo', 'baz').

如果任何令牌包含单引号,则需要通过反斜杠进行转义:

don't - > 'don\'t'

详细说明:

我的文件中包含了半结构化的'表格的句子组成部分:

the grand hall of the hong kong convention  attend by   some # guests
principal representatives of both countries seat on the central dais
representing china  be  mr jiang
britain be  hrh
the principal representatives   be more than    # distinguished guests
hong kong   end with    the playing of the british national anthem
this    follow at   the stroke of midnight
both countries  take part in    the ceremony
the ceremony    start at about  # pm
the ceremony    end about   # am
# royal hong kong police officers   lower   the british hong kong flag
another #   raise   the sar flag
the #   leave for   the royal yacht britannia
the handover of hong kong   hold by the chinese and british governments
the world   cast eye on hong kong
the # governments   hold on schedule
this    be festival for the chinese nation
july # , #  go in   the annals of history
the hong kong compatriots   become master of    this chinese land
hong kong   enter era of    development
history remember    mr deng xiaoping
it  be along    the course
we  resolve the hong kong question
i   wish to express thanks to   all the personages
both china and britain  contribute to   the settlement of the hong kong
the world   support hong kong 's return
i   wish to extend  my cordial greetings and best wishes

正如您所看到的,它们被制表符分隔。我想要做的是从这些数据创建正常的明确子句,将它们渲染为:

'attend by'('some # guests','the grand hall of the hong kong convention').
'take part in'('the ceremony','both countries').
be('representing china', 'mr jiang').

所以在现在的数据中,中间有一个动词短语,它应该成为这个新构造的基础,然后被作用的实体应该是主要参与者后面的第一个参数。

我希望这些最终可以在Prolog中使用。

我猜不是所有的数据都是完全形成的,所以也许我可以把它扔掉。

我想有某种奇特的perl脚本或正则表达式,sed,类型操作可以最有效地实现这一点。我需要在一个大型文件上执行此操作,因此我希望优化效率,这就是我在这里提出的原因。

2 个答案:

答案 0 :(得分:1)

使用sed:

sed "s/\(.*\)\t\(.*\)\t\(.*\)/'\2'('\3', '\1')/" filename

为了保持令牌中没有空格的标记,使用awk会更简单:

awk -F\\t -vq="'" 'function quote(token) { if(index(token, " ")) { return q token q }; return token } { print quote($2) "(" quote($3) ", " quote($1) ")" }' filename

至于性能,我怀疑瓶颈是I / O,而不是这个程序。但是,如果它确实成为一个问题,那么你不想乱用脚本语言并将20行C ++拼凑起来。

编辑:回应评论(我对prolog有什么了解,嗯?:P),总是在引号内引用和引用撇号,awk再次更容易:

awk -F\\t -vq="'" 'function quote(token) { gsub(q, "\\"q, token); return q token q } { print quote($2) "(" quote($3) ", " quote($1) ")" }' filename

但也可以使用sed:

sed "s/'/\\\\'/g;s/\(.*\)\t\(.*\)\t\(.*\)/'\2'('\3', '\1')/" filename

在执行原始操作之前,这会将'替换为\'。引用Shell引用,这就是它需要这么多反斜杠的原因。

请注意,sed解决方案需要在每行中包含两个选项卡。看看测试输入,我不完全确定是这样的,所以awk对你来说可能是更好的选择。

答案 1 :(得分:0)

在SWI-Prolog中,考虑使用tokenize_atom / 2(您需要一个最新版本才能输入源任意长文本常量,并引用')

t :- Text = '
the grand hall of the hong kong convention  attend by   some # guests
principal representatives of both countries seat on the central dais
... rest of text...
the world   support hong kong \'s return
i   wish to extend  my cordial greetings and best wishes',
tokenize_atom(Text,L), maplist(writeln,L).

产量

?- t.
the
grand
hall
of
the
hong
kong
...

所以你可以使用DCG来理解'文本。它比通过外部工具要容易得多,我猜......

编辑让我们的代码Boris'评价:

file_2_statements(File) :-
  atom_codes('\t', Tab),
  open(File, read, S),
  repeat,
   read_line_to_codes(S, L),
   (  L \= end_of_file
   -> append([H,Tab,A1,Tab,A2], L),
      maplist(atom_codes, [Hc,Ac1,Ac2], [H,A1,A2]),
      P =.. [Hc,Ac1,Ac2], assert(P),
      fail
   ;  true
   ),
  close(S).