Question

很抱歉，这是一个非常初学者的问题。但是我有一个来自reddit（https://files.pushshift.io/reddit/submissions/）的多元数据集，但是文件太大了。是否可以将这些文件之一降采样到20％或更少，并保存为新文件（json或csv）或直接将其读取为pandas数据框？任何帮助将不胜感激！

到目前为止，这是我的尝试

def load_json_df(filename, num_bytes = -1):
    '''Load the first `num_bytes` of the filename as a json blob, convert each line into a row in a Pandas data frame.'''
    fs = open(filename, encoding='utf-8')
    df = pd.DataFrame([json.loads(x) for x in fs.readlines(num_bytes)])
    fs.close()
    return df

january_df = load_json_df('RS_2019-01.json')

january_df.sample(frac=0.2)

但是，这在尝试打开它时给了我一个内存错误。有没有一种方法可以对它进行降采样而无需打开整个文件？

Answer 1

问题是，无法确切确定20％的数据是什么。为此，您必须首先读取文件的整个长度，然后才能了解20％的外观。

一次将一个大文件读入内存通常会引发此错误。您可以通过以下代码逐行读取文件来进行处理：

data = []
counter = 0
with open('file') as f:
    for line in f:
        data.append(json.loads(line))
        counter +=1

您应该可以做到这一点

df = pd.DataFrame([x for x in data]) #you can set a range here with counter/5 if you want to get 20%

Answer 2

我首先下载了文件，即https://files.pushshift.io/reddit/submissions/RS_2011-01.bz2 解压缩并查看内容。碰巧的是，它不是适当的JSON，而是JSON行-一系列JSON对象，每行一个（请参见http://jsonlines.org/）。这意味着您可以使用所需的任何工具（例如，文本编辑器）仅剪切所需的任意行。或者，您也可以在Python脚本中按顺序处理文件，并考虑到每五行，例如：

with open('RS_2019-01.json', 'r') as infile:
    for i, line in enumerate(infile):
        if i % 5 == 0: 
             j = json.loads(line)
             # process the data here

如何对.json文件进行降采样

2 个答案: