在Python中以有效的方式清理数据

时间:2014-12-16 07:16:45

标签: python python-2.7 nlp

我有以下格式的数据:

TOP(S(PP-LOC (IN In)(NP(NP (DT an) (NNP Oct。) (CD 19) (NN评论))(PP (IN of)(NP (NP-TTL (DT The) (NN Misanthrope)(&#39;&#39;&#39;&#39;)(PP-LOC (IN at)(NP(NP (NNP Chicago) (POS&#39; s)(NNP Goodman) (NNP剧院)))))(PRN ( - LRB- -LRB - ) (S-HLN(NP-SBJ (VBN Revitalized) (NNS Classics))(VP (VBP Take)( NP (DT the) (NN阶段))(PP-LOC (IN in)(NP (NNP Windy) (NNP City)))))(,,) (&#39;&#39;&#39;&#39;)< / strong>(NP-TMP (NN休闲) (CC&amp;) (NNS Arts)< / strong>)( - RRB- -RRB - ))))(,,)(NP-SBJ-2(NP(NP (DT the)) (NN角色))(PP (IN of)(NP (NNP Celimene))))(, ,)(VP (VBN播放)(NP ( - NONE- *))(PP (IN by)(NP -LGS (NNP Kim) (NNP Cattrall))))(,,))(VP (VBD为)(VP(ADVP-MNR (RB错误)(VBN归属)(NP ( - NONE- * -2))( PP-CLR (TO to)(NP (NNP Christina) (NNP Haag)))))(。 。)))

(TOP(S(NP-SBJ (NNP Ms。) (NNP Haag))(VP (VBZ播放)(NP ( NNP Elianti)))(。)))

.....(还有7000多个......)

这些数据来自一家报纸。新行是一个新句子(以&#39; TOP&#39;开头) 根据这些数据,我只需要每个句子的粗体部分(不带括号):

(IN In)(DT an) (NNP Oct.) (CD 19) (NN review) (IN of) (`` ``) (DT The) (NN Misanthrope)   ('' '')  (IN at)  (NNP Chicago) (POS 's) (NNP Goodman) (NNP Theatre)(-LRB- -LRB-) (`` ``)     (VBN Revitalized) (NNS Classics) (VBP Take) (DT the) (NN Stage)  (IN in)   (NNP Windy) (NNP    City) (, ,) ('' '') (NN Leisure) (CC &) (NNS Arts) (-RRB- -RRB-)(, ,) (DT the) (NN role)(IN of)  (NNP Celimene) (, ,) (VBN played) (-NONE- *)(IN by)(NNP Kim) (NNP Cattrall) (, ,) (VBD was)  (RB mistakenly)(VBN attributed) (-NONE- *-2) (TO to)(NNP Christina) (NNP Haag) (. .)

(NNP Ms.) (NNP Haag) (VBZ plays)(NNP Elianti)(. .)

我尝试了以下内容:

f = open('filename')
data = f.readlines()
f.close()

这部分是为每行创建一个元组数组(使用正则表达式):

tag_word_train = numpy.empty((5000), dtype = 'object')
for i in range(0,5000) :
    tag_word_train[i] = re.findall(r'\(([\w.-]+)\s([\w.-]+)\)',data[i])

这需要很长时间,所以我无法判断它是否正确

你知道如何以有效的方式做到这一点吗?

谢谢,

哈达

3 个答案:

答案 0 :(得分:2)

nltk a Tree class可能符合您的需求。特别是,您希望使用类方法nltk.tree.Tree.fromstring

>>> import nltk.tree
>>> nltk.tree.Tree.fromstring("(S (NP (DT The) (N cat)) (VP (V ran)))")
Tree('S', [Tree('NP', [Tree('DT', ['The']), Tree('N', ['cat'])]), Tree('VP', [Tree('V', ['ran'])])])

答案 1 :(得分:1)

试试这个:

import re

f = open('filename')
data = f.readlines()
f.close()
tag_word_train = numpy.empty((5000), dtype = 'object')
exp = re.compile("\([^()]*\)")

i = 0

for line in data:
    #out = re.findall(exp, data)
    #print(out)
    tag_word_train[i] = re.findall(exp, data)               
    i = i + 1

打破正则表达式:

\(匹配左括号

[^()]*匹配左括号或右括号的零个或多个字符

\)匹配右括号

(我假设你想要的是那些本身并不包括带括号的术语的术语。如果我在这个假设中错了,那么正则表达式就不会做你想做的事。)

答案 2 :(得分:0)

nltk.tree提供的函数既可以读取解析,也可以提取输出中所需的单词对和词性标记:

>>> import nltk.tree
>>> t = nltk.tree.Tree.fromstring("(TOP (S (NP-SBJ (NNP Ms.) (NNP Haag) ) (VP (VBZ plays) (NP (NNP Elianti) )) (. .) ))")
>>> t.pos()
[('Ms.', 'NNP'), ('Haag', 'NNP'), ('plays', 'VBZ'), ('Elianti', 'NNP'), ('.', '.')]