Question

我在tweet句柄元数据的JSON中具有11 GB大小的数据集，我无法理解如何识别重复的twitter句柄以及如何将元数据合并到每个句柄的唯一行中。样本数据如下：

[{'hits': 1,
  'curated': 0,
  'created': '2016-02-02T16:20:50.494Z',
  'domain': 'B_E-R',
  'meta_type': 'type:Username',
  'type': 'Username',
  'id': 'P4AS41',
  'name': '@balashovplay'},
 {'hits': 2,
  'display_name': '@AubreyLee91 (Aubrey Lee)',
  'curated': 0,
  'common_names': ['Aubrey Lee'],
  'created': '2017-07-27T01:11:01.413Z',
  'type': 'Username',
  'created_at': '2015-03-05T20:24:51.279Z',
  'domain': 'BV5',
  'alias': ['3063519157'],
  'meta_type': 'type:Username',
  'external_id': '3063519157',
  'id': 'Tu8sq5',
  'name': '@AubreyLee91'},
 {'hits': 7,
  'display_name': '@AhmadKzha (Ahmad Kzha)',
  'curated': 0,
  'common_names': ['Ahmad Kzha'],
  'created': '2014-04-16T01:33:45.107Z',
  'type': 'Username',
  'created_at': '2010-09-17T11:55:03.000Z',
  'domain': 'B_E-R',
  'alias': ['191803649', '@ahmadthekiller'],
  'meta_type': 'type:Username',
  'external_id': '191803649',
  'id': 'K6hgaW',
  'name': '@AhmadKzha'}

我需要合并具有重复Twitter句柄的json对象。

我成功读取并解析了元数据，并识别了数据框中数据的重复项，其中有人可以提出任何指导和意见，以寻求前进的方式或可以帮助我的代码。 / p>

import pandas as pd 
import os 
import gc 
import json 
import ijson
import io
count = 0
index = 0
df = pd.DataFrame(columns = ['Index','UserID'])
tmp = pd.DataFrame(columns = ['Index','Name'])
data = []
json_file_name = 'username_sample.jsonrows'

with open(json_file_name, encoding="UTF-8") as json_file:
    cursor = 0
    for line_number, line in enumerate(json_file):
        print ("Processing line", line_number + 1,"Line",line, "at 
        cursor index:", cursor)
    line_as_file = io.StringIO(line)
    json_parser = ijson.parse(line_as_file)
    for prefix,type,value in json_parser:
        if (prefix, type) == ('name', 'string'):
            id = value
    tmp.loc[line_number] = [line_number+1,id]
    cursor += len(line)
    if line_number == 50000:
        break

我的代码应该能够识别重复项，并合并两个包含重复用户名元数据的JSON对象。

删除重复的用户句柄并使用python合并json中的行

0 个答案: