Question

我正在阅读大型pickle文件到pandas dataframe，我加载了其中一个并且我按照我需要的方式加载它。但是，我有一个包含40个pickle文件的文件夹，名为imdbnames0.pkl，imdbnames1.pkl，imdbnames2.pkl，....，imdbnames40.pkl。

我想以类似的方式加载它们，并在sinlge pandas数据框中完全合并它们。

fh = open("ethnicity_files/imdbnames1.pkl", 'rb')
d = pickle.load(fh)
df = pd.concat({k:json_normalize(v, 'scores', ['best']) for k,v in d.items()})
df = df.reset_index(level=1, drop=True).rename_axis('names').reset_index()
df.head()



names   ethnicity   score   best
0   !Gubi Tietie    Asian   0.03    GreaterEuropean
1   !Gubi Tietie    GreaterAfrican  0.01    GreaterEuropean
2   !Gubi Tietie    GreaterEuropean 0.96    GreaterEuropean
3   !Gubi Tietie    British 0.17    WestEuropean
4   !Gubi Tietie    Jewish  0.13    WestEuropean
5   !Gubi Tietie    WestEuropean    0.65    WestEuropean
6   !Gubi Tietie    EastEuropean    0.05    WestEuropean
7   !Gubi Tietie    Nordic  0.00    Italian
8   !Gubi Tietie    Italian 0.69    Italian
9   !Gubi Tietie    Hispanic    0.12    Italian
10  !Gubi Tietie    French  0.16    Italian
11  !Gubi Tietie    Germanic    0.02    Italian
12  $2 Tony Asian   0.00    GreaterEuropean
13  $2 Tony GreaterAfrican  0.00    GreaterEuropean
14  $2 Tony GreaterEuropean 1.00    GreaterEuropean
15  $2 Tony British 0.00    WestEuropean
16  $2 Tony Jewish  0.00    WestEuropean
17  $2 Tony WestEuropean    1.00    WestEuropean
18  $2 Tony EastEuropean    0.00    WestEuropean
19  $2 Tony Nordic  0.00    Italian

一个文件是以下https://drive.google.com/file/d/10cjsoWFJ46w-2lEsxh6hmuRZlLunatf-/view?usp=sharing。

我只想将它们全部添加到一个pandas数据帧中。

Answer 1

我认为你需要os.listdir()：

#Be careful this might give you a memory error if you 
#don't have enough ram for all your files 
#and make sure the folder contains only the files you want to read
import os
files = os.listdir('ethnicity_files/')

list_of_dfs = []
for file in files:
    d = pickle.load(os.path.join('ethnicity_files/',file))
    df = pd.concat({k:json_normalize(v, 'scores', ['best']) for k,v in d.items()})
    df = df.reset_index(level=1, drop=True).rename_axis('names').reset_index()
    list_of_dfs.append(df)
big_df = pd.concat(list_of_dfs, ignore_index=True)#ignore_index to reset index of big_df
big_df.head()

Answer 2

您可以使用glob.glob来迭代当前文件夹中具有特定扩展名的所有文件（在您的情况下为.pkl）

import os
import glob
cd=os.getcwd()
os.chdir('path_to_your_folder')

for file in glob.glob("*.pkl"):
  fh = open(str(file), 'rb')
  d = pickle.load(fh)
  df = pd.concat({k:json_normalize(v, 'scores', ['best']) for k,v in d.items()})
  df = df.reset_index(level=1, drop=True).rename_axis('names').reset_index()
os.chdir(cd)
print df.head()

读取文件夹中的多个文件并创建pandas数据帧

2 个答案: