我正在使用Standford Sentiment Treebank数据集,而我正在尝试提取树叶和节点。数据如下:
(3 (2 (2 The) (2 Rock)) (4 (3 (2 is) (4 (2 destined) (2 (2 (2 (2 (2 to) (2 (2 be) (2 (2 the) (2 (2 21st) (2 (2 (2 Century) (2 's)) (2 (3 new) (2 (2 ``) (2 Conan)))))))) (2 '')) (2 and)) (3 (2 that) (3 (2 he) (3 (2 's) (3 (2 going) (3 (2 to) (4 (3 (2 make) (3 (3 (2 a) (3 splash)) (2 (2 even) (3 greater)))) (2 (2 than) (2 (2 (2 (2 (1 (2 Arnold) (2 Schwarzenegger)) (2 ,)) (2 (2 Jean-Claud) (2 (2 Van) (2 Damme)))) (2 or)) (2 (2 Steven) (2 Segal))))))))))))) (2 .)))
我想要的东西如下:
i)带有标签的叶子(uni-gram):
[(2 The), (2 Rock), (2 is), (2 destined),...]
ii)带有标签的uper节点(bi-gram):
[(2 (2 the) (2 Rock)), (2 (2 ``) (2 Conan)), (2 (2 Century) (2 's)),..]
直到我到达树的根部。
我尝试使用正则表达式来完成此操作,但无法正确输出。
我的代码(用于单一格式):
import re
import nltk
location = '.../NLP/Standford_Sentiment_Tree_Data_Set/' +\
'trainDevTestTrees_PTB/trees/train.txt'
text = open(location, 'r')
test = text.readlines()[0]
text.close()
uni_regex = re.compile(r'(\([0-4] \w+\))')
temp01 = uni_regex.findall(test)
# bi-gram
bi_regex = re.compile(r'(\([0-4] \([0-4] \w+\) \([0-4] \w+\)\))')
temp02 = bi_regex.findall(test)
以上代码输出:
['(2 The)', '(2 Rock)', '(2 is)', '(2 destined)', '(2 to)', '(2 be)', '(2 the)', '(2 21st)', '(2 Century)', '(3 new)',...]
无法捕获(2 ``)
,(2 '')
并提取(2 Jean)
而不是(2 Jean-Claude)
输出无法捕获(2 (2``) (2 Conan))
有没有办法使用nltk
或regex
的某些配置获取我想要的结果,不会错过任何令牌?
我看过并尝试修改NLTK tree data structure, finding a node, it's parent or children中提供的解决方案,但该问题似乎涉及在休假中查找特定单词并显示树结构,而我需要缩进解决方案类似于上面的n-gram。
答案 0 :(得分:2)
不要浪费你的时间与正则表达式,这是树类的目的。像这样使用nltk的Tree
类:
mytree = "(3 (2 (2 The) (2 Rock)) (4 (3 (2 is) (4 (2 destined) (2 (2 (2 (2 (2 to) (2 (2 be) (2 (2 the) (2 (2 21st) (2 (2 (2 Century) (2 's)) (2 (3 new) (2 (2 ``) (2 Conan)))))))) (2 '')) (2 and)) (3 (2 that) (3 (2 he) (3 (2 's) (3 (2 going) (3 (2 to) (4 (3 (2 make) (3 (3 (2 a) (3 splash)) (2 (2 even) (3 greater)))) (2 (2 than) (2 (2 (2 (2 (1 (2 Arnold) (2 Schwarzenegger)) (2 ,)) (2 (2 Jean-Claud) (2 (2 Van) (2 Damme)))) (2 or)) (2 (2 Steven) (2 Segal))))))))))))) (2 .)))"
>>> t = nltk.Tree.fromstring(mytree)
>>> print(t)
(3
(2 (2 The) (2 Rock))
(4
(3
(2 is)
(4
(2 destined)
(2
...
然后,您可以提取并计算叶子,并请求相应的"树位置" (每个叶子的路径,以列表的形式):
>>> leafpos = [ t.leaf_treeposition(n) for n, x in enumerate(t.leaves()) ]
>>> print(leafpos[0:3])
[(0, 0, 0), (0, 1, 0), (1, 0, 0, 0)]
最后,你可以走向树状位置以获得你想要的单位:每个叶子正上方的节点占主导的子树,每个叶子上面两个步骤等等:
>>> level1_subtrees = [ t[path[:-1]] for path in leafpos ]
>>> for x in level1_subtrees:
... print(x, end = " ")
(2 The) (2 Rock) (2 is) (2 destined) (2 to) (2 be) (2 the) ...
>>> level2_subtrees = [ t[path[:-2]] for path in leafpos ]
但请注意,更高级别的子树看起来并不像您想象的那样。例如,如果你从第3页(destined
)上升到两个级别,你就不会得到一个" bigram"。您将位于标有4
的节点上,该节点占据了句子其余部分的大部分。也许您真的对枚举所有子树感兴趣?在这种情况下,只需迭代t.subtrees()
。
如果这不是您想要的,请查看Tree
API并选择其他方式来选择您需要的部分。