我正在使用POS标记器,我必须从一个看起来像这样的文件中进行读取:
(TOP (S (NP (DT The)
(NNP September-October)
(NN term)
(NN jury))
(AUX (VBD had))
(VP (VBN been)
(VP (VBN charged)
(PP (IN by)
(NP (NNP Fulton)
(NNP Superior)
(NNP Court)
(NNP Judge)
(NP (NNP Durwood)
(NNP Pye))))
(S (NP (-NONE- *))
(AUX (TO to))
(VP (VB investigate)
(NP (NNS reports)
(PP (IN of)
(NP (JJ possible)
(`` ``)
(NNS irregularities)
('' '')
(PP (IN in)
(NP (NP (DT the)
(ADJP (JJ hard-fought))
(NN primary))
(SBAR (WHNP (WDT which))
(S (NP (-NONE- T))
(AUX (VBD was))
(VP (VBN won)
(PP (IN by)
(NP (NNP Mayor-nominate)
(NP (NNP Ivan)
(NNP Allen)
(NNP Jr)
(. .)))))))))))))))))
(. .))
但是以树的形式出现,我需要将其转换为如下形式:
DT The NNP September-October NN term...
使用Python。该文件有几个句子树,每个句子树必须打印在单独的一行上。
我尝试过这样的事情:
re.split(r'(?!END_OF_TEXT_UNIT)\[TOP]\w+(. .)$', text)
它为我提供了我所需的一切,但是每个POS和单词都在自己的行上。请帮忙!