Question

我有以下格式的数据：

TOP（S（PP-LOC （IN In）（NP（NP （DT an） （NNP Oct。） （CD 19） （NN评论））（PP （IN of）（NP （）（NP-TTL （DT The） （NN Misanthrope））（＆＃39;＆＃39;＆＃39;＆＃39;）（PP-LOC （IN at）（NP（NP （NNP Chicago） （POS＆＃39; s））（NNP Goodman） （NNP剧院）））））（PRN （ - LRB- -LRB - ） （）（S-HLN（NP-SBJ （VBN Revitalized） （NNS Classics））（VP （VBP Take）（ NP （DT the） （NN阶段））（PP-LOC （IN in）（NP （NNP Windy） （NNP City）））））（,,） （＆＃39;＆＃39;＆＃39;＆＃39;）< / strong>（NP-TMP （NN休闲） （CC＆amp;） （NNS Arts）< / strong>）（ - RRB- -RRB - ））））（,,）（NP-SBJ-2（NP（NP （DT the）） （NN角色））（PP （IN of）（NP （NNP Celimene））））（，，）（VP （VBN播放）（NP （ - NONE- *））（PP （IN by）（NP -LGS （NNP Kim） （NNP Cattrall））））（,,））（VP （VBD为）（VP（ADVP-MNR （RB错误））（VBN归属）（NP （ - NONE- * -2））（ PP-CLR （TO to）（NP （NNP Christina） （NNP Haag）））））（。。）））

（TOP（S（NP-SBJ （NNP Ms。） （NNP Haag））（VP （VBZ播放）（NP （ NNP Elianti）））（。）））

.....（还有7000多个......）

这些数据来自一家报纸。新行是一个新句子（以＆＃39; TOP＆＃39;开头）根据这些数据，我只需要每个句子的粗体部分（不带括号）：

(IN In)(DT an) (NNP Oct.) (CD 19) (NN review) (IN of) (`` ``) (DT The) (NN Misanthrope) ('' '') (IN at) (NNP Chicago) (POS 's) (NNP Goodman) (NNP Theatre)(-LRB- -LRB-) (`` ``) (VBN Revitalized) (NNS Classics) (VBP Take) (DT the) (NN Stage) (IN in) (NNP Windy) (NNP City) (, ,) ('' '') (NN Leisure) (CC &) (NNS Arts) (-RRB- -RRB-)(, ,) (DT the) (NN role)(IN of) (NNP Celimene) (, ,) (VBN played) (-NONE- *)(IN by)(NNP Kim) (NNP Cattrall) (, ,) (VBD was) (RB mistakenly)(VBN attributed) (-NONE- *-2) (TO to)(NNP Christina) (NNP Haag) (. .) (NNP Ms.) (NNP Haag) (VBZ plays)(NNP Elianti)(. .)

我尝试了以下内容：

f = open('filename') data = f.readlines() f.close()

这部分是为每行创建一个元组数组（使用正则表达式）：

tag_word_train = numpy.empty((5000), dtype = 'object') for i in range(0,5000) : tag_word_train[i] = re.findall(r'\(([\w.-]+)\s([\w.-]+)\)',data[i])

这需要很长时间，所以我无法判断它是否正确

你知道如何以有效的方式做到这一点吗？

谢谢，

哈达

Answer 1

nltk a Tree class可能符合您的需求。特别是，您希望使用类方法nltk.tree.Tree.fromstring：

>>> import nltk.tree
>>> nltk.tree.Tree.fromstring("(S (NP (DT The) (N cat)) (VP (V ran)))")
Tree('S', [Tree('NP', [Tree('DT', ['The']), Tree('N', ['cat'])]), Tree('VP', [Tree('V', ['ran'])])])

Answer 2

试试这个：

import re

f = open('filename')
data = f.readlines()
f.close()
tag_word_train = numpy.empty((5000), dtype = 'object')
exp = re.compile("\([^()]*\)")

i = 0

for line in data:
    #out = re.findall(exp, data)
    #print(out)
    tag_word_train[i] = re.findall(exp, data)               
    i = i + 1

打破正则表达式：

\(匹配左括号

[^()]*匹配不左括号或右括号的零个或多个字符

\)匹配右括号

（我假设你想要的是那些本身并不包括带括号的术语的术语。如果我在这个假设中错了，那么正则表达式就不会做你想做的事。）

Answer 3

nltk.tree提供的函数既可以读取解析，也可以提取输出中所需的单词对和词性标记：

>>> import nltk.tree
>>> t = nltk.tree.Tree.fromstring("(TOP (S (NP-SBJ (NNP Ms.) (NNP Haag) ) (VP (VBZ plays) (NP (NNP Elianti) )) (. .) ))")
>>> t.pos()
[('Ms.', 'NNP'), ('Haag', 'NNP'), ('plays', 'VBZ'), ('Elianti', 'NNP'), ('.', '.')]

在Python中以有效的方式清理数据

3 个答案: