我正在尝试构建一个nltk来获取单词的上下文。我有两句话
sentences=pd.DataFrame({"sentence": ["The weather was good so I went swimming", "Because of the good food we took desert"]})
我想知道“好”这个词是指什么。我的想法是将句子(来自教程here的代码)分块,然后查看单词“good”和名词是否在同一个节点中。如果不是,它指的是之前或之后的名词。
首先,我按照教程
构建Chunkerfrom nltk.corpus import conll2000
test_sents = conll2000.chunked_sents('test.txt', chunk_types=['NP'])
train_sents = conll2000.chunked_sents('train.txt', chunk_types=['NP'])
class ChunkParser(nltk.ChunkParserI):
def __init__(self, train_sents):
train_data = [[(t,c) for w,t,c in nltk.chunk.tree2conlltags(sent)]
for sent in train_sents]
self.tagger = nltk.TrigramTagger(train_data)
def parse(self, sentence):
pos_tags = [pos for (word,pos) in sentence]
tagged_pos_tags = self.tagger.tag(pos_tags)
chunktags = [chunktag for (pos, chunktag) in tagged_pos_tags]
conlltags = [(word, pos, chunktag) for ((word,pos),chunktag)
in zip(sentence, chunktags)]
return nltk.chunk.conlltags2tree(conlltags)
NPChunker = ChunkParser(train_sents)
然后,我将这个应用于我的句子:
sentence=sentences["sentence"][0]
tags=nltk.pos_tag(sentence.lower().split())
result = NPChunker.parse(tags)
print result
结果如下所示
(S
(NP the/DT weather/NN)
was/VBD
(NP good/JJ)
so/RB
(NP i/JJ)
went/VBD
swimming/VBG)
现在我想“找到”哪个节点“好”这个词。我还没有想出一个更好的方法,但计算节点和叶子中的单词。 “好”这个词在句子中是第3个字。
stuctured_sentence=[]
for n in range(len(result)):
stuctured_sentence.append(list(result[n]))
structure_length=[]
for n in result:
if isinstance(n, nltk.tree.Tree):
if n.label() == 'NP':
print n
structure_length.append(len(n))
else:
print str(n) +"is a leaf"
structure_length.append(1)
从总结单词数量来看,我知道“好”这个词的位置。
structure_frame=pd.DataFrame({"structure": stuctured_sentence, "length": structure_length})
structure_frame["cumsum"]=structure_frame["length"].cumsum()
是否有更简单的方法来确定单词的节点或叶子,并找出“好”指的是哪个单词?
最佳亚历克斯
答案 0 :(得分:4)
最容易在叶子列表中找到您的单词。然后,您可以将叶索引转换为树索引,该索引是树下的路径。要查看与Option Explicit
Public strFileName As String
Sub EmailPDFAsAttachment()
'This macro grabs the file path and stores as a concatenation/variable. Then it emails the file to whomever you specify.
' Works in Excel 2000, Excel 2002, Excel 2003, Excel 2007, Excel 2010, Outlook 2000, Outlook 2002, Outlook 2003, Outlook 2007, Outlook 2010.
' This example sends the last saved version of the Activeworkbook object .
Dim OutApp As Object
Dim OutMail As Object
Dim FilePath As String
'This part is setting the strings and objects to be files to grab with their associated filepath. (e.g. FilePath is setting itself equal to the text where we plan to set up each report)
FilePath = "\\"ServerNameHere"\UserFolders\_AutoRep\DA\PDFs\SealantsVS1SurfaceRestore\" _
& strFileName & ".pdf"
With Application
.EnableEvents = True
.ScreenUpdating = True
' End With
'Below is where it creats the actual email and opens up outlook.
Set OutApp = CreateObject("Outlook.Application")
Set OutMail = OutApp.CreateItem(0)
On Error Resume Next
' ******Make sure to set the .To to only recipients that are required to view it. Separate email addresses with a semicolon (;).
' Current distribution list:
'
With OutMail
.To = "example@Example.com"
.CC = ""
.BCC = ""
.Subject = strFileName
.HTMLBody = "Hello all!" & "<br>" & _
"Here is this month's report for the Sealants vs Surface Restore. It goes as granular as to by show results by provider." & "<br>" & _
"Let me know what you think or any comments or questions you have!" & "<br>" & _
vbNewLine & .HTMLBody
'Here it attached the file, saves the email as a draft, and then sends the file if everything checks out.
.Attachments.Add FilePath
.Send
End With
On Error GoTo 0
' With Application
' .EnableEvents = True
' .ScreenUpdating = True
End With
'This closes out the Outlook application.
Set OutMail = Nothing
Set OutApp = Nothing
End Sub
分组的内容,请上一级并检查此选择的子树。
首先,找出good
在句子中的位置。 (如果您仍然将未标记的句子作为令牌列表,则可以跳过此步骤。)
good
现在我们找到words = [ w for w, t in result.leaves() ]
的线性位置,并转换为树路径:
good
“treeposition”是树下的路径,表示为元组。 (NLTK树可以使用元组和整数编制索引。)要查看>>> position = words.index("good")
>>> treeposition = result.leaf_treeposition(position)
>>> print(treeposition)
(2, 0)
的姐妹,请在到达路径末尾之前停止一步。
good
你有。一个带有一片叶子的子树,>>> print(result[ treeposition[:-1] ])
Tree('NP', [('good', 'JJ')])
对。