如何从jsonline文件的每一行中提取元素?

时间:2019-05-26 14:18:04

标签: python jsonlines

我有一个jsonl文件,该文件每行包含一个句子和在该句子中找到的标记。我希望从JSON lines文件的每一行中提取令牌,但是我的循环仅从最后一行返回令牌。

这是输入。

{"text":"This is the first sentence.","_input_hash":2083129218,"_task_hash":-536378640,"spans":[],"meta":{"score":0.5,"pattern":65},"answer":"accept","tokens":[
{"text":"This","id":0},
{"text":"is","id":1},
{"text":"the","id":2},
{"text":"first","id":3},
{"text":"sentence","id":4},
{"text":".","id":5}]}
{"text":"This is the second sentence.","_input_hash":2083129218,"_task_hash":-536378640,"spans":[],"meta":{"score":0.5,"pattern":65},"answer":"accept","tokens":[
{"text":"This","id":0},
{"text":"is","id":1},
{"text":"the","id":2},
{"text":"second","id":3},
{"text":"sentence","id":4},
{"text":".","id":5}]}

我尝试运行以下代码:

with jsonlines.open('path/to/file') as reader:
        for obj in reader:
        data = obj['tokens'] # just extract the tokens
        data = [(i['text'], i['id']) for i in data] # elements from the tokens

data

实际结果:

  

[('This',0),    ('是',1),    ('the',2),    (“第一”,3),    (“句子”,4),    ('。',5)]

我想要得到的结果是什么

enter image description here

其他问题

某些令牌包含“标签”而不是“ id”。如何将其合并到代码中?一个例子是:

{"text":"This is the first sentence.","_input_hash":2083129218,"_task_hash":-536378640,"spans":[],"meta":{"score":0.5,"pattern":65},"answer":"accept","tokens":[
{"text":"This","id":0},
{"text":"is","id":1},
{"text":"the","id":2},
{"text":"first","id":3},
{"text":"sentence","id":4},
{"text":".","id":5}]}
{"text":"This is coded in python.","_input_hash":2083129218,"_task_hash":-536378640,"spans":[],"meta":{"score":0.5,"pattern":65},"answer":"accept","tokens":[
{"text":"This","id":0},
{"text":"is","id":1},
{"text":"coded","id":2},
{"text":"in","id":3},
{"text":"python","label":"Programming"},
{"text":".","id":5}]}

2 个答案:

答案 0 :(得分:1)

f=open('data.csv','w')
print('Sentence','Word','ID',file=f)
with jsonlines.open('path/to/file') as reader:
        for sentence_no,obj in enumerate(reader):
            data = obj['tokens']
            for i in data:
                print(sentence_no+1,i['text'], i['id']+1,file=f)

答案 1 :(得分:1)

代码中的某些问题/更改

  • 您每次都会在循环中重新分配变量data,因此您只会看到最后json行的结果,而您想每次都扩展列表

  • 您想在enumerate迭代器上使用reader来获取元组的第一项

然后代码更改为

import jsonlines

data = []
#Iterate over the json files
with jsonlines.open('file.txt') as reader:
    #Iterate over the each line on the reader via enumerate
    for idx, obj in enumerate(reader):

        #Append the data to the result
        data.extend([(idx+1, i['text'], i['id']+1) for i in obj['tokens']])  # elements from the tokens

print(data)

或者通过在列表理解本身中创建一个双for循环来变得更紧凑

import jsonlines

#Open the file, iterate over the tokens and make the tuples
result = [(idx+1, i['text'], i['id']+1) for idx, obj in enumerate(jsonlines.open('file.txt')) for i in obj['tokens']]

print(result)

输出将为

[
(1, 'This', 1), 
(1, 'is', 2), 
(1, 'the', 3), 
(1, 'first', 4), 
(1, 'sentence', 5), 
(1, '.', 6), 
(2, 'This', 1), 
(2, 'is', 2), 
(2, 'the', 3), 
(2, 'second', 4), 
(2, 'sentence', 5), 
(2, '.', 6)
]