我目前正在开发一个java Web服务器项目,需要使用自然语言处理,特别是命名实体识别( NER )。
我正在使用OpenNLP for java,因为很容易添加自定义训练数据。它完美地运作。
但是,我还需要能够在实体内部提取entites(嵌套命名实体识别)。我尝试在OpenNLP中执行此操作,但我得到了解析错误。所以我的猜测是,OpenNLP遗憾地不支持嵌套实体。
以下是我需要解析的一个示例:
提醒我 [START:提醒] 向 [START:contact] John [END] 和 [开始:联系] 查理 [结束] [结束] 。
如果使用OpenNLP无法实现这一点,是否还有其他Java NLP库可以执行此操作。如果根本没有Java库,是否有任何其他语言的NLP库可以做到这一点?
请帮忙。谢谢!
答案 0 :(得分:1)
简短的回答是:
答案 1 :(得分:1)
出于名称实体识别(基于Java)的目的,我使用以下内容:
https://github.com/merishav/cleartk-tutorials
您可以为您的用例训练模型,我已经为人,地点,出生日期,职业培训了NER。 ClearTK为您提供了MalletCRFClassifier的包装器。
答案 2 :(得分:0)
使用此python源代码(Python 3)https://gist.github.com/ttpro1995/cd8c60cfc72416a02713bb93dff9ae6f
它会为您创建多个嵌套版本的嵌套数据。
对于下面的输入句子(输入数据必须首先被标记化,因此其周围和周围都有空间)
Remind me to <START:reminder> give some presents to <START:contact> John <END> and <START:contact> Charlie <END> <END> .
它输出具有不同嵌套级别的多个句子。
Remind me to give some presents to John and Charlie .
Remind me to <START:reminder> give some presents to John and Charlie <END> .
Remind me to give some presents to <START:contact> John <END> and <START:contact> Charlie <END> .
此处提供完整的源代码,可用于快速复制粘贴
import sys
END_TAG = 0
START_TAG = 1
NOT_TAG = -1
def detect_tag(in_token):
"""
detect tag in token
:param in_token:
:return:
"""
if "<START:" in in_token:
return START_TAG
elif "<END>" == in_token:
return END_TAG
return NOT_TAG
def remove_nest_tag(in_str):
"""
với <START:ORGANIZATION> Sở Cảnh sát Phòng cháy , chữa cháy ( PCCC ) và cứu nạn , cứu hộ <START:LOCATION> Hà Nội <END> <END>
:param in_str:
:return:
"""
state = 0
taglist = []
tag_dict = dict()
sentence_token = in_str.split()
## detect token tag
max_nest = 0
for index, token in enumerate(sentence_token):
# print(token + str(detect_tag(token)))
tag = detect_tag(token)
if tag > 0:
state += 1
if max_nest < state:
max_nest = state
token_info = (index, state, token)
taglist.append(token_info)
tag_dict[index] = token_info
elif tag == 0:
token_info = (index, state, token)
taglist.append(token_info)
tag_dict[index] = token_info
state -= 1
generate_sentences = []
for state in range(max_nest+1):
generate_sentence_token = []
for index, token in enumerate(sentence_token):
if detect_tag(token) >= 0: # is a tag
token_info = tag_dict[index]
if token_info[1] == state:
generate_sentence_token.append(token)
elif detect_tag(token) == -1 : # not a tag
generate_sentence_token.append(token)
sentence = ' '.join(generate_sentence_token)
generate_sentences.append(sentence)
return generate_sentences
# generate sentence
print(taglist)
def test():
tstr2 = "Remind me to <START:reminder> give some presents to <START:contact> John <END> and <START:contact> Charlie <END> <END> ."
result = remove_nest_tag(tstr2)
print("-----")
for sentence in result:
print(sentence)
if __name__ == "__main__":
"""
un-nest dataset for opennlp name
"""
# test()
# test()
if len(sys.argv) > 1:
inpath = sys.argv[1]
infile = open(inpath, 'r')
outfile = open(inpath+".out", 'w')
for line in infile:
sentences = remove_nest_tag(line)
for sentence in sentences:
outfile.write(sentence+"\n")
outfile.close()
else:
print("usage: python unnest_data.py input.txt")