我有一个数据集,它代表已解析文本的一个句子,看起来像这样:
[{
"address": 1,
"ctag": "Ne",
"feats": "_",
"head": 6,
"lemma": "Ashraf",
"rel": "SBJ",
"tag": "Ne",
"word": "Ashraf"
}, {
"address": 2,
"ctag": "AJ",
"feats": "_",
"head": 1,
"lemma": "Ghani",
"rel": "NPOSTMOD",
"tag": "AJ",
"word": "Ghani"
}, {
"address": 3,
"ctag": "P",
"feats": "_",
"head": 6,
"lemma": "in",
"rel": "ADV",
"tag": "P",
"word": "in"
}, {
"address": 4,
"ctag": "N",
"feats": "_",
"head": 3,
"lemma": "Kabul",
"rel": "POSDEP",
"tag": "N",
"word": "Kabul"
}, {
"address": 5,
"ctag": "N",
"feats": "_",
"head": 6,
"lemma": "born",
"rel": "NVE",
"tag": "N",
"word": "born"
}, {
"address": 6,
"ctag": "V",
"feats": "_",
"head": 0,
"lemma": "شدشو",
"rel": "ROOT",
"tag": "V",
"word": "شده_است"
}, {
"address": 7,
"ctag": "PUNC",
"feats": "_",
"head": 6,
"lemma": ".",
"rel": "PUNC",
"tag": "PUNC",
"word": "."
}]
在"Adress": 7
中,句子的"ctag":"PUNC"
当前结尾。我的原始数据集包含几个句子。首先,我想从每个以PUNC
或.
结尾的句子中检测出这是一个句子,我要首先检查一下,在第一个句子中提取特殊的two or three
实体像'ctag'='Ne'
和下一个单词'ctag'= 'N'
,则这两个实体之间的关系为'rel'= 'NVE'
,然后应将其存储在列表中。
我做了什么:
# read file
with open('../data/parse.txt', 'r') as myfile:
obj = json.load(myfile)
for w in obj:
if w['ctag'] == 'Ne' and w['rel'] == 'SBJ':
n1.append(w['word'])
if w['ctag'] == 'N' and w['rel'] == 'SBJ':
n6.append(w['word'])
if w['ctag'] == 'N' and w['rel'] == 'MOZ':
n2.append(w['word'])
if w['rel'] == 'NVE' and w['ctag'] == 'N':
n3.append(w['word'])
if w['rel'] == 'MOZ' and w['ctag'] =='Ne':
n4.append(w['word'])
if w['rel'] =='MOS' and w['ctag'] == 'Ne':
n5.append(w['word'])
if w['rel'] == 'OBJ' and w['ctag'] == 'N':
n7.append(w['word'])
这意味着我从27
地址发现了这么多实体:
rel=SBJ & Ne: ['Ashraf']
rel=MOZ & Ne ['President', 'Capital', 'Lecturer', 'University']
rel=MOS & Ne ['Ashraf', 'Kabul', 'Ahmad']
rel=MOZ & N ['Afghanistan', 'Afghanistan', 'Kabul']
rel=NVE & N ['born']
rel=SBJ & N ['Kabul']
rel=OBJ & N ['Located']
我想要什么:
-它应该找到第一个"."
或"PUNC"
,然后在第一句中检查实体,所以我发现if word['ctag'] =='Ne' and word['rel'] == 'MOZ':
然后是关系实体,其余的是命名实体实体,例如subject & object
。
-然后应该转到下一个"PUNC"
并检索Ne
和rel
。
->我期望每个句子的输出:
(e1, relation, e2)
-> (Kabul, located, Afghanistan)