Question

问题已解决：事实证明，我从来没有真正遇到过问题。当我计算记录数以确定我应该导入多少记录时，.json对象之间的空格被添加到总记录数。但是，导入时，仅移动包含内容的对象。我会在这里留下这篇文章作为参考。感谢那些为此做出贡献的人。

我有约33GB的.JSON文件，这些文件是从存储在本地目录中的Twitter的流API中检索到的。我试图将此数据导入MongoDB集合。我做了两次尝试：

首次尝试：单独读取每个文件（~70个文件）。这成功导入了11,171,885 / 22,343,770个文件。

import json
import glob
from pymongo import MongoClient

directory = '/data/twitter/output/*.json'
client = MongoClient("localhost", 27017)
db = client.twitter
collection = db.test

jsonFiles = glob.glob(directory)
for file in jsonFiles:
        f = open(file, 'r')
        for line in f.read().split("\n"):
                if line:
                        try:
                                lineJson = json.loads(line)
                        except (ValueError, KeyError, TypeError) as e:
                                pass
                        else:
                                postid = collection.insert(lineJson)
                                print 'inserted with id: ' , postid

        f.close()

第二次尝试：将每个.JSON文件连接成一个大文件。这成功导入了11,171,879 / 22,343,770个文件。

import json
import os
from pymongo import MongoClient
import sys

client = MongoClient("localhost", 27017)
db = client.tweets
collection = db.test

script_dir = os.path.dirname(__file__)
file_path = os.path.join(script_dir, '/data/twitter/blob/historical-tweets.json')

try:
        with open(file_path, 'r') as f:
                for line in f.read().split("\n"):
                        if line:
                                try:
                                        lineJson = json.loads(line)
                                except (ValueError, KeyError, TypeError) as e:
                                        pass
                                else:
                                        postid = collection.insert(lineJson)
                                        print 'inserted with id: ' , postid

                f.close()

python脚本没有输出错误并输出回溯，它只是停止运行。有什么想法导致这个？或者更有效地导入数据的任何替代解决方案？提前谢谢。

Answer 1

您正在读取文件一行。每一行都是真正有效的json吗？如果没有，json.loads将跟踪并使用pass语句隐藏该跟踪。

无法将所有.JSON文件导入MongoDB

1 个答案: