Question

我从mongodb开始。我有一个包含3400000行数据的文本文件，我想将该数据上传到mongodb数据库。

文本文件如下所示：

360715;157.55.34.97;Mozilla/5.0;/pub/index.asp;NULL
3360714;157.55.32.233;Mozilla/5.0;/pub/index.asp;NULL
....

我希望将它放在具有以下结构的mongodb数据库中：

{'log' : '360715;157.55.34.97;Mozilla/5.0;/pub/index.asp;NULL'}
{'log' : '3360714;157.55.32.233;Mozilla/5.0;/pub/index.asp;NULL'}
....

其实我是这样逐行上传的：

for data_line in records:
    parsed_line = re.sub(r'[^a-zA-Z0-9,.\;():/]', '', data_line) 
    to_insert = unicode(parsed_line, "utf-8")
    db.data_source.insert({'log':to_insert})

有没有办法一次将所有行上传到我想要的格式？ “逐行”方法要慢得多。

以前，我在python中编写了一个脚本来解析文本文件，它需要1秒到100000行，需要65秒到3400000.

我考虑使用mongodb来改进这个过程，但实际上，使用mongodb只需要将100000行上传到我的脚本以执行所有数据解析的数据库。因此，很难说我正在做一些与mongodb严重错误的事情。

如果有人可以帮助我，我很感激。

问候，João

Answer 1

我会尝试将数据批量插入mongodb。现在，您正在为文本文件中的每一行调用insert（）一次，这很昂贵。

如果您运行的是mongoDB版本2.6.x，则可以使用批量插入http://api.mongodb.org/python/current/examples/bulk.html。本教程下面的示例代码：

>>> import pymongo
>>> db = pymongo.MongoClient().bulk_example
>>> db.test.insert(({'i': i} for i in xrange(10000)))
[...]
>>> db.test.count()
10000

在这种情况下，单个insert（）调用将10000个文档插入到mongodb中。

下面的另一个例子：

>>> new_posts = [{"author": "Mike",
...               "text": "Another post!",
...               "tags": ["bulk", "insert"],
...               "date": datetime.datetime(2009, 11, 12, 11, 14)},
...              {"author": "Eliot",
...               "title": "MongoDB is fun",
...               "text": "and pretty easy too!",
...               "date": datetime.datetime(2009, 11, 10, 10, 45)}]
>>> posts.insert(new_posts)
[ObjectId('...'), ObjectId('...')]

http://api.mongodb.org/python/current/tutorial.html

将所有文本文件内容上传到mongodb（使用pymongo）

1 个答案: