所以我有这个dblp数据集,需要在其中排除会议。这是我将json转换为csv的代码,但是我需要对其进行一些更改,因此它只能复制会议以外的文件。我的逻辑是在会场找到会议,但代码无法正常工作
import json
import csv
with open('test1.json') as lines, open('data3.csv', 'w',encoding='utf-8') as output:
output = csv.DictWriter(output, ['abstract','authors','n_citation',"references","title","venue","year",'id'],lineterminator='\n')
output.writeheader()
for line in lines:
line = line.strip()
if line[0] == '{' and line[-1] == '}':
if line.find("conference")!=True:
output.writerow(json.loads(line))
这是示例json
{"abstract": "AdaBoost algorithm based on Haar-like features can achieves high accuracy (above 95%) in object detection.",
"authors": ["Zheng Xu", "Runbin Shi", "Zhihao Sun", "Yaqi Li", "Yuanjia Zhao", "Chenjian Wu"],
"n_citation": 0,
"references": ["0a11984c-ab6e-4b75-9291-e1b700c98d52", "1f4152a3-481f-4adf-a29a-2193a3d4303c", "3c2ddf0a-237b-4d17-8083-c90df5f3514b", "522ce553-29ea-4e0b-9ad3-0ed4eb9de065", "579e5f24-5b13-4e92-b255-0c46d066e306", "5d0b987d-eed9-42ce-9bf3-734d98824f1b", "80656b4d-b24c-4d92-8753-bdb965bcd50a", "d6e37fb1-5f7e-448e-847b-7d1f1271c574"],
"title": "A Heterogeneous System for Real-Time Detection with AdaBoost",
"venue": "high performance computing and communications",
"year": 2016,
"id": "001eef4f-1d00-4ae6-8b4f-7e66344bbc6e"}
{"abstract": "In this paper, a kind of novel jigsaw EBG structure is designed and applied into conformal antenna array",
"authors": ["Yufei Liang", "Yan Zhang", "Tao Dong", "Shan-wei Lu"],
"n_citation": 0,
"references": [],
"title": "A novel conformal jigsaw EBG structure design",
"venue": "international conference on conceptual structures",
"year": 2016,
"id": "002e0b7e-d62f-4140-b015-1fe29a9acbaa"}
如果我删除此行,代码可以正常工作
if line.find("conference")!=True:
这是下载示例json文件的链接
https://drive.google.com/open?id=1056yrc_Y4Y-tAZT52YUDxPPsWYsLcn48
较小的json http://s000.tinyupload.com/?file_id=57175973595937350188
答案 0 :(得分:1)
以下似乎有效。由于输入文件的每一行都包含一个完整的JSON对象,因此它首先调用json.loads()
以获取Python字典,然后检查字典的内容以查看其是否具有"venue"
键,以及是否确实如此,该键的字符串值是否包含子字符串"conference"
。
还请注意,我认为您确实不需要:
line = line.strip()
if line[0] == '{' and line[-1] == '}':
部分,但是由于我没有整个文件,所以我把它留了下来。不会造成伤害,但是会在一定程度上减慢处理速度。
import csv
import json
fields = 'abstract,authors,n_citation,references,title,venue,year,id'.split(',')
with open('test1.json') as lines, \
open('data3.csv', 'w', encoding='utf-8') as output:
output = csv.DictWriter(output, fields, lineterminator='\n')
output.writeheader()
for line in lines:
line = line.strip()
if line[0] == '{' and line[-1] == '}':
json_obj = json.loads(line)
if 'conference' not in json_obj.get('venue', ''):
output.writerow(json_obj)
已修改,以回答(我认为)评论中的后续问题:
import collections
import csv
import json
from pprint import pprint
fields = 'abstract,authors,n_citation,references,title,venue,year,id'.split(',')
# Added.
venue_citations = collections.defaultdict(int) # Total number of citations per venue.
with open('test1.json') as lines, \
open('data3.csv', 'w', encoding='utf-8') as output:
output = csv.DictWriter(output, fields, lineterminator='\n')
output.writeheader()
for line in lines:
line = line.strip()
if line[0] == '{' and line[-1] == '}':
json_obj = json.loads(line)
venue = json_obj.get('venue', '')
if 'conference' not in venue:
output.writerow(json_obj)
venue_citations[venue] += json_obj["n_citation"] # Update count.
pprint(dict(venue_citations))
答案 1 :(得分:0)
您可以使用json模块轻松访问不同的字段。如果您能够将json对象表示为test1.json文件中的列表,则可以调用json.load(open('test1.json','r'))
来将json数据作为json对象列表加载。如果这不可能,则可以尝试以下方法。
import json
json_objs=list()
#iterate through the json data and create json objects
with open('test.json') as lines:
s_buffer = list()
for line in lines:
s_buffer.append(line)
if '}' in line:
json_objs.append(json.loads(''.join(s_buffer)))
s_buffer = list()
#check if each event is a conference or not
output_list = list()
for obj in json_objs:
if not 'conference' in obj['venue']:
output_list.append(obj)
在处理数据时尝试写入文件可能会导致性能下降,我将输出数据附加到output_list
上,以后可以用来写入csv文件。