熊猫:读取多个.bz2大文件并将其附加

时间:2020-01-13 23:15:30

标签: python json pandas for-loop glob

我要读取30个.bz2文件。每个文件太大而无法读取,因此每个文件的x大小块就足够了。然后,我想将所有这30个文件合并在一起。

import pandas as pd
import numpy as np
import glob
path = r'/content/drive/My Drive/'                     # use your path
all_files = glob.glob(os.path.join(path, "*.bz2"))     # advisable to use os.path.join as this makes concatenation OS independent

# Below I read 10,000 lines X 11 in for each file because of RAM limit and append it together. 
# How do I make it so it also appends each of the 30 files together?? I made an attempt below.

chunks = (pd.read_json(f, lines=True, chunksize = 1000) for f in all_files)
i = 0
chunk_list = []
for chunk in chunks:
    if i >= 11:
        break
    i += 1
    chunk_list.append(chunk)
    df = pd.concat(chunk_list, sort = True)
#print(df)
df

.bz2样本数据位于: https://csr.lanl.gov/data/2017.html

1 个答案:

答案 0 :(得分:0)

NullPointerException

这似乎可行