Question

我目前正在处理大量的Twitter数据（150万条推文，~4G数据），我正在尝试做一些快速而肮脏的数据探索。下面是我的代码，我对这段代码的目标是查找字段＆＃39; text＆＃39;不是空的。我作为第一步使用df.shape进行计数。但是，这项相当直接的任务花费的时间比预期的要长。你们对我如何改进代码以更快地获得结果有什么建议吗？提前谢谢。

def is_json(myjson):
    try:
        json_object = json.loads(myjson)
    except ValueError as e:
        return False
    return True
    
#this function opens and reads a json file while filtering out non-valid 
#JSON formats with the is_json function

with open ('json_test_file.txt') as data_file:
    data = pd.DataFrame(json.loads(line) for line in data_file if is_json(line))

#this only takes rows with 'text' field not null
df = data[pd.notnull(data['text'])]

print(df.shape)

对于中等大小的数据

0 个答案: