Question

我目前正在开展一个项目，我在其中使用Sentiment Analysis进行Twitter帖子。我正在使用Sentiment140对Tweets进行分类。使用该工具，我每天可以分类多达1,000,000条推文，我收集了大约750,000条推文。所以这应该没问题。唯一的问题是我可以一次向JSON批量分类发送最多15,000条推文。

我的整个代码已设置并运行。唯一的问题是我的JSON文件现在包含所有750,000个推文。

因此我的问题是：将JSON拆分为具有相同结构的较小文件的最佳方法是什么？我更愿意在Python中这样做。

我考虑过迭代文件。但是如何在代码中指定它应该在例如5,000个元素之后创建一个新文件？

我希望能得到一些最合理的方法。谢谢！

编辑：这是我目前的代码。

import itertools
import json
from itertools import izip_longest

def grouper(iterable, n, fillvalue=None):
    "Collect data into fixed-length chunks or blocks"
    # grouper('ABCDEFG', 3, 'x') --> ABC DEF Gxx
    args = [iter(iterable)] * n
    return izip_longest(fillvalue=fillvalue, *args)

# Open JSON file
values = open('Tweets.json').read()
#print values

# Adjust formatting of JSON file
values = values.replace('\n', '')    # do your cleanup here
#print values

v = values.encode('utf-8')
#print v

# Load JSON file
v = json.loads(v)
print type(v)

for i, group in enumerate(grouper(v, 5000)):
    with open('outputbatch_{}.json'.format(i), 'w') as outputfile:
        json.dump(list(group), outputfile)

输出结果为：

["data", null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, ...]

在名为“outputbatch_0.json”

的文件中

编辑2：这是JSON的结构。

{
"data": [
{
"text": "So has @MissJia already discussed this Kelly Rowland Dirty Laundry song? I ain't trying to go all through her timelime...",
"id": "1"
},
{
"text": "RT @UrbanBelleMag: While everyone waits for Kelly Rowland to name her abusive ex, don't hold your breath. But she does say he's changed: ht\u00e2\u20ac\u00a6",
"id": "2"
},
{
"text": "@Iknowimbetter naw if its weak which I dont think it will be im not gonna want to buy and up buying Kanye or even Kelly Rowland album lol",
"id": "3"}
]
}

Answer 1

使用迭代分组器; itertools module recipes list包含以下内容：

from itertools import izip_longest

def grouper(iterable, n, fillvalue=None):
    "Collect data into fixed-length chunks or blocks"
    # grouper('ABCDEFG', 3, 'x') --> ABC DEF Gxx
    args = [iter(iterable)] * n
    return izip_longest(fillvalue=fillvalue, *args)

这使您可以在5000组中迭代推文：

for i, group in enumerate(grouper(input_tweets, 5000)):
    with open('outputbatch_{}.json'.format(i), 'w') as outputfile:
        json.dump(list(group), outputfile)

Answer 2

我认为你的第一个想法是好的。只需遍历所有推文，将它们保存在临时数组中，并跟踪每个推文增加一个的索引。总是当current-index modulo 5000等于0时，调用一个以字符串格式转换推文的方法，并将其保存在文件名中包含索引的文件中。如果您到达推文的末尾，请在最后一次休息时执行相同的操作。

我不确定我是否可以回答你的问题。如果你正在寻找一些更复杂的东西，只需谷歌关于hadoop json-file splitter。

使用Python将JSON文件拆分为相同/更小的部分

2 个答案: