从JSON文件中检测句子并提取相关实体

时间:2019-03-05 11:20:59

标签: python json

我有一个数据集,它代表已解析文本的一个句子,看起来像这样:

[{
    "address": 1,
    "ctag": "Ne",
    "feats": "_",
    "head": 6,
    "lemma": "Ashraf",
    "rel": "SBJ",
    "tag": "Ne",
    "word": "Ashraf"
}, {
    "address": 2,
    "ctag": "AJ",
    "feats": "_",
    "head": 1,
    "lemma": "Ghani",
    "rel": "NPOSTMOD",
    "tag": "AJ",
    "word": "Ghani"
}, {
    "address": 3,
    "ctag": "P",
    "feats": "_",
    "head": 6,
    "lemma": "in",
    "rel": "ADV",
    "tag": "P",
    "word": "in"
}, {
    "address": 4,
    "ctag": "N",
    "feats": "_",
    "head": 3,
    "lemma": "Kabul",
    "rel": "POSDEP",
    "tag": "N",
    "word": "Kabul"
}, {
    "address": 5,
    "ctag": "N",
    "feats": "_",
    "head": 6,
    "lemma": "born",
    "rel": "NVE",
    "tag": "N",
    "word": "born"
}, {
    "address": 6,
    "ctag": "V",
    "feats": "_",
    "head": 0,
    "lemma": "شدشو",
    "rel": "ROOT",
    "tag": "V",
    "word": "شده_است"
}, {
    "address": 7,
    "ctag": "PUNC",
    "feats": "_",
    "head": 6,
    "lemma": ".",
    "rel": "PUNC",
    "tag": "PUNC",
    "word": "."
}]

"Adress": 7中,句子的"ctag":"PUNC"当前结尾。我的原始数据集包含几个句子。首先,我想从每个以PUNC.结尾的句子中检测出这是一个句子,我要首先检查一下,在第一个句子中提取特殊的two or three实体像'ctag'='Ne'和下一个单词'ctag'= 'N',则这两个实体之间的关系为'rel'= 'NVE',然后应将其存储在列表中。

我做了什么:

# read file
with open('../data/parse.txt', 'r') as myfile:
    obj = json.load(myfile)
for w in obj:
    if w['ctag'] == 'Ne' and w['rel'] == 'SBJ':
        n1.append(w['word'])
    if w['ctag'] == 'N' and w['rel'] == 'SBJ':
        n6.append(w['word'])
    if w['ctag'] == 'N' and w['rel'] == 'MOZ':
        n2.append(w['word'])
    if w['rel'] == 'NVE' and w['ctag'] == 'N':
        n3.append(w['word'])
    if w['rel'] == 'MOZ' and w['ctag'] =='Ne':
        n4.append(w['word'])
    if w['rel'] =='MOS' and w['ctag'] == 'Ne':
        n5.append(w['word'])
    if w['rel'] == 'OBJ' and w['ctag'] == 'N':
       n7.append(w['word'])

这意味着我从27地址发现了这么多实体:

rel=SBJ & Ne: ['Ashraf']
rel=MOZ & Ne ['President', 'Capital', 'Lecturer', 'University']
rel=MOS & Ne ['Ashraf', 'Kabul', 'Ahmad']
rel=MOZ & N ['Afghanistan', 'Afghanistan', 'Kabul']
rel=NVE & N ['born']
rel=SBJ & N ['Kabul']
rel=OBJ & N ['Located']

我想要什么: -它应该找到第一个".""PUNC",然后在第一句中检查实体,所以我发现if word['ctag'] =='Ne' and word['rel'] == 'MOZ':然后是关系实体,其余的是命名实体实体,例如subject & object。 -然后应该转到下一个"PUNC"并检索Nerel

->我期望每个句子的输出:    (e1, relation, e2)-> (Kabul, located, Afghanistan)

0 个答案:

没有答案