我有以下格式的数据:
TOP(S(PP-LOC (IN In)(NP(NP (DT an) (NNP Oct。) (CD 19) (NN评论))(PP (IN of)(NP () (NP-TTL (DT The) (NN Misanthrope))('''')(PP-LOC (IN at)(NP(NP (NNP Chicago) (POS' s))(NNP Goodman) (NNP剧院)))))(PRN ( - LRB- -LRB - ) (
)(S-HLN(NP-SBJ (VBN Revitalized) (NNS Classics))(VP (VBP Take)( NP (DT the) (NN阶段))(PP-LOC (IN in)(NP (NNP Windy) (NNP City)))))(,,) (&#39;&#39;&#39;&#39;)< / strong>(NP-TMP (NN休闲) (CC&amp;) (NNS Arts)< / strong>)( - RRB- -RRB - ))))(,,)(NP-SBJ-2(NP(NP (DT the)) (NN角色))(PP (IN of)(NP (NNP Celimene))))(, ,)(VP (VBN播放)(NP ( - NONE- *))(PP (IN by)(NP -LGS (NNP Kim) (NNP Cattrall))))(,,))(VP (VBD为)(VP(ADVP-MNR (RB错误))(VBN归属)(NP ( - NONE- * -2))( PP-CLR (TO to)(NP (NNP Christina) (NNP Haag)))))(。 。)))
(TOP(S(NP-SBJ (NNP Ms。) (NNP Haag))(VP (VBZ播放)(NP ( NNP Elianti)))(。)))
.....(还有7000多个......)
这些数据来自一家报纸。新行是一个新句子(以&#39; TOP&#39;开头) 根据这些数据,我只需要每个句子的粗体部分(不带括号):
(IN In)(DT an) (NNP Oct.) (CD 19) (NN review) (IN of) (`` ``) (DT The) (NN Misanthrope) ('' '') (IN at) (NNP Chicago) (POS 's) (NNP Goodman) (NNP Theatre)(-LRB- -LRB-) (`` ``) (VBN Revitalized) (NNS Classics) (VBP Take) (DT the) (NN Stage) (IN in) (NNP Windy) (NNP City) (, ,) ('' '') (NN Leisure) (CC &) (NNS Arts) (-RRB- -RRB-)(, ,) (DT the) (NN role)(IN of) (NNP Celimene) (, ,) (VBN played) (-NONE- *)(IN by)(NNP Kim) (NNP Cattrall) (, ,) (VBD was) (RB mistakenly)(VBN attributed) (-NONE- *-2) (TO to)(NNP Christina) (NNP Haag) (. .)
(NNP Ms.) (NNP Haag) (VBZ plays)(NNP Elianti)(. .)
我尝试了以下内容:
f = open('filename')
data = f.readlines()
f.close()
这部分是为每行创建一个元组数组(使用正则表达式):
tag_word_train = numpy.empty((5000), dtype = 'object')
for i in range(0,5000) :
tag_word_train[i] = re.findall(r'\(([\w.-]+)\s([\w.-]+)\)',data[i])
这需要很长时间,所以我无法判断它是否正确
你知道如何以有效的方式做到这一点吗?
谢谢,
哈达
答案 0 :(得分:2)
nltk
a Tree
class可能符合您的需求。特别是,您希望使用类方法nltk.tree.Tree.fromstring
:
>>> import nltk.tree
>>> nltk.tree.Tree.fromstring("(S (NP (DT The) (N cat)) (VP (V ran)))")
Tree('S', [Tree('NP', [Tree('DT', ['The']), Tree('N', ['cat'])]), Tree('VP', [Tree('V', ['ran'])])])
答案 1 :(得分:1)
试试这个:
import re
f = open('filename')
data = f.readlines()
f.close()
tag_word_train = numpy.empty((5000), dtype = 'object')
exp = re.compile("\([^()]*\)")
i = 0
for line in data:
#out = re.findall(exp, data)
#print(out)
tag_word_train[i] = re.findall(exp, data)
i = i + 1
打破正则表达式:
\(
匹配左括号
[^()]*
匹配不左括号或右括号的零个或多个字符
\)
匹配右括号
(我假设你想要的是那些本身并不包括带括号的术语的术语。如果我在这个假设中错了,那么正则表达式就不会做你想做的事。)
答案 2 :(得分:0)
nltk.tree
提供的函数既可以读取解析,也可以提取输出中所需的单词对和词性标记:
>>> import nltk.tree
>>> t = nltk.tree.Tree.fromstring("(TOP (S (NP-SBJ (NNP Ms.) (NNP Haag) ) (VP (VBZ plays) (NP (NNP Elianti) )) (. .) ))")
>>> t.pos()
[('Ms.', 'NNP'), ('Haag', 'NNP'), ('plays', 'VBZ'), ('Elianti', 'NNP'), ('.', '.')]